Real-World Error Handling in Distributed Systems
Source: Dev.to
The Problem with Traditional Exception Handling
On a single machine, throwing an exception feels reasonable. The stack trace is there, the debugger catches it, and you fix the issue. In distributed systems, that mental model breaks almost immediately.
- Once a request crosses process boundaries, context starts disappearing. By the time an exception reaches an API gateway or message broker, the original cause is often gone.
- Async calls, background queues, and retries further blur the picture, leaving you with a failure that is technically visible but practically useless.
Retries Can Be Dangerous
- Retrying blindly can turn a small transient issue into a cascading failure.
- A short database hiccup suddenly becomes a flood of repeated requests that overload the system even further.
Cloud Platforms Add Their Own Complications
- Background jobs, serverless functions, and orchestrators frequently report success while quietly logging errors somewhere nobody is watching.
- From the platform’s point of view, the job completed. From your point of view, critical logic never ran.
The Final Casualty: The User
- Users see a vague error message.
- Support teams cannot trace what happened.
- Engineers are left digging through logs late at night with no clear starting point.
HTTP Status Codes Aren’t Enough
Status codes tell you something went wrong, but not what or why. Clients need structured information that can be logged, displayed, and correlated across services.
A Real‑World .NET Example
public IActionResult GetUser(int id)
{
var user = _userService.GetUser(id);
if (user == null)
{
return NotFound(new ErrorResponse
{
Code = "USER_NOT_FOUND",
Message = $"User with id {id} does not exist.",
CorrelationId = HttpContext.TraceIdentifier
});
}
return Ok(user);
}
- Error codes allow front‑ends and other services to react consistently.
- Clear messages help users and support teams.
- Correlation IDs make it possible to trace a failure across logs, queues, and downstream calls.
Idempotency & Duplicate Handling
In distributed systems, retries are unavoidable. Network calls fail. Timeouts happen. Messages get re‑delivered. If your system cannot safely handle duplicate requests, you will eventually see data corruption or duplicated side effects.
- APIs – Require an
Idempotency-Keyheader. The backend must check and store that key so repeated requests do not re‑run the same operation. - Background jobs / message consumers – Store processed message identifiers in a fast store (Redis, DynamoDB, or a DB table with a unique constraint) to prevent duplicate processing. Skipping this step is how you end up charging customers twice or sending duplicate emails in production.
Logging, Structured Logs & Alerts
- Early in my career I avoided logging too much because it felt noisy. In distributed systems, under‑logging is a far bigger problem than over‑logging.
- Structured logs (JSON, key‑value pairs) let you query by correlation ID, user ID, or operation name. Including context such as environment, request identifiers, and key input values turns logs into a diagnostic tool rather than a last resort.
- Alerts matter just as much as logs. Logging an error that nobody sees is equivalent to ignoring it. Alert on patterns such as repeated failures, growing queue backlogs, or unusual spikes to react before users notice.
Practical Recommendations
- Be explicit – Throw exceptions only when something truly fails.
- Configure retries with back‑off at the orchestrator level.
- Send critical errors to a shared alerting channel where humans will actually see them.
- For batch jobs / background processing, write failures to a dedicated table or queue. This creates a paper trail that can be inspected, replayed, or manually resolved.
- Frontend error handling should be intentional:
- Use React error boundaries to catch unexpected failures.
- Surface backend error messages carefully—no internal stack traces, but enough detail to be actionable.
- Make retry behavior explicit so users know whether trying again makes sense or if support is needed.
Partial Failures & Sagas
Partial failures are unavoidable in distributed workflows. In saga‑style processes, one service can succeed while another fails. Rollbacks are often impossible, so compensating actions and clear logging become essential.
Environment Drift
Environment drift is another silent kil… (the original text cuts off here; continue as needed).
Error Handling in Distributed Systems
- Development, staging, and production often behave differently due to configuration mismatches. Testing error scenarios across environments is tedious but necessary.
- AI integrations introduce their own risks. Large language models can time‑out, return malformed responses, or behave unpredictably. Wrapping these calls with timeouts, circuit breakers, and strict response validation prevents them from becoming a new source of instability.
- Define a shared error contract across services from the beginning. Retrofitting this later is painful and error‑prone.
- Treat every network call as a potential failure, even internal ones. Assuming reliability is how systems fail unexpectedly.
- Invest in log correlation and searchability before the first production incident. It is much harder to add observability after users are already affected.
- Design error responses with clear codes, messages, and correlation identifiers instead of relying on raw exceptions.
- Make all side‑effecting APIs and background jobs idempotent or accept that duplicate processing will happen.
- Log errors with context, not just stack traces, and alert on meaningful patterns.
- Assume cloud platforms will hide failures unless you make them visible.
- In React applications, surface errors honestly and clearly instead of masking them behind generic messages.
Discussion Prompt
How do you handle partial failures and retries in your own distributed systems?
What patterns saved you during incidents, and what approaches failed under pressure?
I would love to hear your war stories or disagreements.
Offer to Share Examples
If you want a C# error‑response template or a concrete idempotency example, let me know and I can share what has worked for me.