Why Error Handling Is Architecture, Not Just Code

You call a payment API. It returns "timeout."

Did it succeed? Should you retry? Or check first? The check API also timed out. Now what?

The code doesn't tell you. Because the error handling is just log.Error(err); return err. The log gets another line, the caller gets an error — what this error means and what to do next, nobody knows. The log was written for whoever debugs three months later, but three months later, that person can't determine from the log whether the money was actually deducted.

Many systems don't fail because the happy path is poorly written. They fail because the failure paths are like shacks thrown together at the last minute — things that should retry don't, things that should compensate don't, things that should alert stay silent, things that should stop keep going. Each error handling decision looks reasonable in isolation — "log a line, return the error" — but ten such decisions stitched together produce a system whose failure behavior is emergent, not designed.

Error handling isn't what you write after if err != nil. It's a system's ability to remain predictable, recoverable, and explainable when facing failure. This is an architectural decision.

Errors Aren't Just Two Kinds

Most programming languages model errors as a single value — in Go it's an error interface, in Java an Exception object, in Python a raised exception. This modeling creates an illusion: errors are one of two states — "succeeded" or "failed" — and failure is failure, no distinction.

But failure in distributed systems isn't binary.

"Timeout" isn't failure — it's "don't know if it succeeded." The downstream might have finished processing but the response didn't come back; or it might never have received the request. Retry? Might cause a double charge. Don't retry? The payment might not have gone through. "Don't know" is a third state, but our error model crams it into the same error as "connection refused."

"Partial success" isn't failure either. Batch processing 100 items — 23 succeeded, 77 skipped due to primary key conflicts, 0 actually failed. Is this success or failure? Does the caller get an error or nil?

Then there's "degraded but available" — primary path is down, the fallback is carrying the load, functionality is reduced but the system is alive. Return an error to the caller? If nil, the caller thinks everything's fine; if error, the caller might trigger unnecessary alerts and rollbacks.

A single error string carries all these possibilities but doesn't structurally express them. What developers do at each catch block or if err != nil — retry? degrade? alert? ignore? — depends not on system-level design but on personal experience and mood.

Error handling strategy shouldn't be a developer's on-the-spot improvisation — it should be conscious system-level design. When you write return err, you're making an architectural decision — determining this error's propagation scope, recovery strategy, and observability. You just didn't realize you were making this decision.

Four Questions That Must Be Answered

For every critical operation in a system, four questions need answering at the error handling level. Not code-level questions — design-level.

First: Retryability. Is retrying this operation safe after failure?

Not all failures warrant retrying. Timeout might be fine — you don't know if it succeeded, but the downstream API claims idempotency, so retrying is safe. Connection refused might be fine — the request never went out, retrying won't cause side effects. But "insufficient balance" doesn't benefit from retrying — unless the user topped up in the meantime. "Order not found" doesn't either — data doesn't materialize on its own.

Distinguishing transient from permanent failures isn't a judgment to make at each call site. It's the interface designer's job to classify when defining error codes or error types. Callers shouldn't need to understand business details to know whether to retry — this is part of the error contract.

Second: Exposability. Who should see this error? See what?

Show the user "network timeout, please try again later." Log the full request parameters, trace ID, downstream response body. Expose only error codes and messages to callers — not internal call chains. When a third-party API is down, your API should return "service temporarily unavailable," not "calling third-party API X returned connection refused." The latter isn't just unhelpful — it leaks your system topology.

Gradual error message translation and redaction isn't a security team's post-hoc checklist. An API's error response schema and its success response schema are two halves of the same contract — defining only the success structure leaves the contract incomplete. This should be answered at every service boundary when designing external interfaces.

Third: Data consistency. After failure, what state is the data in?

DB write succeeded, message publish failed — the order is "paid" in the DB but no notification went out. Points deducted successfully, points ledger entry failed — points are deducted but the ledger doesn't have this record. When audit finds the discrepancy, you can't explain it.

This isn't "should we use transactions." Cross-resource transactions don't exist. The question is: in this partial-failure scenario, can the system recover to a consistent state through retry, compensation, or async reconciliation? If not, the failure isn't "error handling wasn't written well" — the operation itself lacks fault tolerance and needs redesigning.

Many data inconsistency bugs trace back not to code errors but to operations split too fine or too coarse, with no recovery path on failure. This should be visible at design time.

Fourth: Observability. Should this error trigger an alert? What level?

Not every error is worth waking someone up. "User submitted an invalid parameter" shouldn't alert — return a 400 and move on. "Downstream timeout rate jumped from 0.1% to 5%" should alert, but a single error log can't show this — it requires aggregation. "Dead letter queue backlog exceeds 1,000" should alert, but if you don't have a dead letter queue, this metric doesn't exist.

Alerting strategy must be designed at the system level: which errors to count, which counts need thresholds, which thresholds trigger what level of notification. Relying on developers to write log.Error in code and hoping ops will spot anomalies in logs — that's not strategy, that's gambling.

Retry Isn't "Try Again"

Thinking of retry as "call it again if it fails" is how retry goes wrong. Retry is the operation most likely to amplify failures in a distributed system.

First problem: backoff strategy. Retrying immediately after failure means the downstream is probably still struggling. Three retries with zero-second intervals equals hitting the downstream with 4 requests within 100ms. If the downstream was timing out due to overload, these 4 requests make it slower. Exponential backoff with jitter — first wait 100ms, then 200ms, then 400ms, each with a random offset — isn't about elegance; it's about not kicking a downstream that's already down.

Second problem: retry storms. Upstream retries, upstream's upstream retries, the user's browser retries. A brief downstream hiccup, with every layer retrying, multiplies request volume several fold. A hiccup becomes a cascade. This is why retry should happen at specific layers in the chain, not every layer. The gateway can retry; your service layer shouldn't — that's stacking.

Third problem: retry requires idempotency. An operation without idempotency guarantees plus retry equals a production bug recipe. You retry a payment API; the user gets charged twice — "I thought the first one didn't succeed" isn't an excuse. Idempotency isn't implemented at retry time — it must be designed in when the operation is defined. Where does the idempotency key come from? What's its validity period? Does the downstream accept the same key? These are interface contract concerns, not code details.

Timeouts and Circuit Breakers

There's a class of bug called "slow, not broken."

The downstream doesn't return an error — it just returns slowly. Your service waits, connection pool gradually draining. Upstream is also waiting for you, also draining its pool. An entire chain waits on the slowest downstream. Nobody errors, but the system is already unavailable.

Cross-process calls should have explicit timeouts. Set a timeout calling downstream; set a timeout for upstream calling you. Each layer should have a clear answer: "how long do I wait before waiting isn't worth it." This duration shouldn't be a number a developer fills in casually — it should be based on this interface's P99 latency, the business-acceptable response time, and this failure's impact on the entire chain. A historical-order-list query endpoint might set a 2-second timeout; a payment callback might set 5 seconds — but there should be a reason, not a guessed number.

Set the timeouts, and there's still another problem: if the downstream is definitively unavailable (persistently high timeout rate), should you keep sending requests?

That's circuit breaker logic. When failure rate exceeds a threshold, the breaker opens — fail fast. Don't waste downstream resources, don't waste upstream wait time. Circuit breaking isn't punishing the downstream. It's self-protection: "I know you're not okay right now; I'll stop hitting you; recover at your own pace; I'll use my fallback." After a cooldown, the breaker probes with a test request — half-open state. If the probe succeeds, full traffic resumes; if it still fails, stay open.

These things sound like middleware configuration — add a library, set parameters. But which interfaces need circuit breakers? What thresholds? What's the fallback after breaking — cached data? Degraded alternate path? Direct error? These aren't pure ops configuration — they're business-level architectural choices. A payment endpoint's circuit breaker strategy and a product recommendation endpoint's strategy can be completely different — the former may not want the same fail-fast approach, because the user being unable to pay is worse than the system being slow.

Compensation and Saga

Not all operations can roll back.

Sent emails can't be unsent. Deducted points can't be "undone" — you can only add a new points record to compensate. A third-party API called has no rollback interface — their system isn't yours to design.

In these scenarios, the "transaction rollback" instinct doesn't apply. What's needed is compensation: an independent business operation whose effect cancels out the original operation's effect in business terms. A refund is payment's compensation, not payment's "rollback." A return is shipping's compensation. Re-crediting points is deduction's compensation.

Compensation has three easily overlooked properties.

Compensation itself can fail. Points deducted successfully; re-crediting fails because the points service is down — compensation needs its own retry and idempotency. A compensation operation isn't a simple inverse function; it has its own independent failure modes, timeout settings, and alert conditions.

Compensation may be "eventual," not "immediate." An email sent can't be recalled — you can only send a correction email. The user may have already seen the first one — this is compensation's cost. When designing an operation, if you know its compensation doesn't take effect immediately, you know the user-visible layer needs additional handling: status messaging, notification rhythm, customer service talking points.

Compensation may trigger new business rules. A return isn't a physical "reverse shipment" — it's an independent business process: inspection, refund, warehousing, financial processing. Its complexity is on par with the forward flow. Treating it as "call this on error" severely underestimates its design cost.

Saga is the pattern for orchestrating these compensations: a long transaction decomposed into multiple local transactions, each paired with a compensating transaction. If one step fails, previously successful steps' compensations execute in order.

But Saga introduces a new problem: during compensation, the system's state is externally visible. A user might see an order go from "paid" to "refunding" — intermediate states are exposed. This is Saga's inherent cost, not an implementation flaw. Choosing Saga means accepting externally visible intermediate states and designing for that visibility: what the user sees, when, and what they can do about it.

The Value of Dead Letter Queues

Retried N times, still failing — now what?

Drop it? No — you don't know if this message is important. Manual investigation? Fine — but that depends on someone seeing the logs in time, and the logs haven't been rotated away. Return an error to the caller? If it's async, the caller may have already discarded the request.

A dead letter queue acknowledges something: some failures require human judgment. Not all problems can be handled purely in code. A well-designed dead letter queue turns "we can't handle this" into "we set it aside for a human to handle" — this is an architectural decision, not an admission of defeat.

A dead letter queue's value goes beyond "storing." A good DLQ should answer: how many times this message failed, what each failure's error was, what state it's in now (pending / manually replayed / confirmed discarded). Then build operations processes around these capabilities: who monitors the DLQ? What backlog level triggers alerts? What's the manual replay SOP?

In systems I've seen, many lack dead letter queues not because they're unnecessary, but because nobody considered "handling failed messages" as something needing design. Messages fail and fail; retry a few rounds and move on. When you ask "where do these failed messages go?" the answer is "don't know" — that itself is the problem.

How to Design a System's Error Handling

Having covered all this, back to the most practical question: if you're starting to design a new system, or inheriting one without systematic error handling, where do you begin?

The starting point isn't code. It's answering several questions in the design document for each critical interface:

If this interface's downstream times out — what then? Retry or fail fast? How many retries? What's the idempotency key?
If the downstream returns a business error (not a system error) — what then? Which business errors should be returned to the caller as-is, which should be translated?
If the DB write succeeds but message publishing fails — what then? Is there a compensation path? If not, what's the window during which data inconsistency is observable?
Does this interface's failure need alerting? What level? Alert on individual failures or on success-rate-below-threshold?

Not every interface needs to answer all questions. A config-reading interface and a payment-deduction interface have completely different granularity requirements. Monolith-internal function calls don't need circuit breakers — these strategies kick in at cross-process, cross-service boundaries. But for critical interfaces — payment, ordering, refunds, shipping, points changes — if nobody asked "what happens when this fails?" during the design phase, the design isn't incomplete — it hasn't begun.

When I review a new system's design document and see only sequence diagrams showing the happy path, I get uneasy. Not because sequence diagrams are bad — but because they only show what the system looks like when everything works. A system's reliability isn't determined by how fast and smoothly it runs the happy path — it's determined by whether its behavior under failure remains within control and understanding.

Error Handling Is Design Quality, Not Code Quality

My first few years in the industry, I thought error handling was code detail — part of "code standards." Variable naming, function length, error handling — all code review checklist items. Later I stopped thinking that way.

Not because standards don't matter. Because if you don't treat failure as a first-class citizen during design, code review can't save you. A reviewer can at best tell you "you should check for nil here" or "this error shouldn't be swallowed here." They can't, in the time it takes to review one PR, weave a dozen scattered error handling decisions into a designed whole. That's unrealistic.

Good error handling isn't measured by how many if err != nil blocks you wrote. It's measured by walking into a design review and being able to whiteboard: every step on this chain — its timeout, where to retry, where to circuit-break, where to compensate, where failed messages go, what triggers a page — and justify each decision.

The happy path is the entrance exam. The failure path is the architecture design exam. Treat error handling as an architectural decision to design, not a code annotation to write.

Errors Aren't Just Two Kinds #

Four Questions That Must Be Answered #

Retry Isn't "Try Again" #

Timeouts and Circuit Breakers #

Compensation and Saga #

The Value of Dead Letter Queues #

How to Design a System's Error Handling #

Error Handling Is Design Quality, Not Code Quality #