Most Production Bugs Don’t Live in Your Code. They Live Between Systems
Source: Dev.to
Why the hardest problems live between systems
Everyone loves talking about writing code—clean code, fast code, AI‑generated code, boiler‑free code.
But in many real systems the hardest problems don’t live in the controller, stored procedure, or API method you wrote last week. They show up between systems.
A request can be valid in System A, transformed in middleware, partially enriched in System B, rejected silently in System C, and then reported back to the business as “completed” because a status flag looked fine at the wrong layer. This kind of failure makes production issues painful to explain: the cause and effect are spread across distributed components, so the failure is not obvious.
Observability and distributed tracing
OpenTelemetry’s observability docs describe distributed tracing as a way to observe requests as they propagate through complex distributed systems, helping debug behavior that is difficult to reproduce locally.
The same idea appears in enterprise integration platforms. SAP’s documentation notes that the Message Processing Log stores data about processed messages and individual processing steps, and the message monitor lets you inspect individual messages on a tenant. In other words, understanding message flow step‑by‑step is essential.
Key questions to ask during an incident
When a failure occurs, consider:
- What actually happened?
- Which system is the source of truth?
- Was the payload wrong, or was the transformation wrong?
- Did the receiving system reject it, ignore it, or accept it and fail later?
- Are we looking at a business‑process failure or just a misleading status?
Answering these questions requires a different mindset: stop thinking like a builder for a minute and start thinking like an investigator.
Essential data for tracing failures
- Timestamps
- Correlation IDs
- Payload versions
- Processing steps
- Retry history
- Side effects
- Context propagation
OpenTelemetry describes context propagation as the mechanism that allows signals like traces, metrics, and logs to be correlated across distributed boundaries. If your systems cannot carry context forward, debugging becomes much harder.
Why “works on my machine” is insufficient
Locally you usually don’t have:
- Asynchronous retries
- Middleware transformations
- Environment‑specific credentials
- Downstream validation rules
- Race conditions between systems
- Stale reference data
- A workflow engine making a different decision than expected
By the time a failure reaches production, it is often an observability, systems‑thinking, or handoff problem rather than a coding problem.
Takeaway
Senior engineering is not just about building systems. It is about explaining how they fail.
Engineers who can trace failures across boundaries become the go‑to people for the hardest incidents—not because they write the fanciest code, but because they know how to follow the truth across system boundaries.