How backend production systems actually fail

Published: (February 22, 2026 at 08:20 AM EST)
8 min read
Source: Dev.to

Source: Dev.to

Introduction

Systems in production tend to experience incidents, though some more than others. Most of the time, when something goes wrong in production, the code is doing exactly what it was written to do. The problem is that production introduces conditions that cannot be fully simulated ahead of time. In this article, I will discuss how these failures actually happen, group them into three patterns, mention why these patterns are dangerous, and touch on lessons that can be learned.

Production systems don’t fail because code is bad; they fail because reality isn’t always consistent.

Prerequisites

Before I proceed, please note that this article is for:

  • Backend Engineers
  • People running production systems
  • Anyone who has dashboards that say “green” while users complain

Failure Patterns

Failure Pattern #1: Cascading Failures

Cascading failures occur when one service in a system becomes slow or fails, which in turn affects how other parts of the system that depend on the service behave. Cascading failures can even arise from small user actions, for example, retries.

Example

I once worked on a project where a cascading failure occurred. Certain DB queries created bottlenecks due to their complexity and the growing size of the data. To make matters worse, the connection pool had reached its maximum number of slots, so further database calls could not be processed. This resulted in two likely scenarios:

  1. The user’s request was abruptly cancelled and they tried again.
  2. The request lingered in the system, waiting for an open connection to execute the query.

The second option became a cascading failure: the system tried to process more than it should at a given time while also handling regular incoming requests. This led to longer wait times; even simple tasks like logging in took a long time. In some cases, the CPU maxed out and the entire system became unresponsive, putting the system in a slow state.

What happens in the background?

Each request runs on a thread for its lifetime. The thread pool (i.e., the allocation of threads for processes) is limited; when requests pile up, they consume the available threads, leaving new requests stuck in a waiting state. What actually gets exhausted is rarely “the server” itself—it’s thread pools, DB connections, or queue workers.

Why is this dangerous?

  • Slowness is contagious. A slow component or service will affect the delivery of other services, presenting a broken system to users.
  • Partial health illusion. A system can look healthy in isolation while failing as a whole due to cascading failures. Depending on the design, some services may continue to operate, but their dependence on the affected service causes the entire system to fail.

Lessons learned

  1. Timeouts – Ensure long‑running requests or batch jobs are terminated after a reasonable period. Timeouts can be applied to known bottlenecks, such as external provider calls or heavy database queries.
  2. Circuit breakers – Route traffic away from failing services or dependencies to healthy alternatives. For example, if a third‑party payment provider is down, a circuit breaker can switch to another provider until the primary one recovers.

Failure Pattern #2: Partial Failures

Partial failures occur when only a part of a system fails while other parts continue to function, leading to incomplete or inconsistent results. They are subtle but can be very expensive.

Example

I worked on a payment system where users could initiate charges. One user attempted to charge their card but received no response in time, so they retried. The payment service experienced a brief downtime and could not fully process the request, but it still accepted incoming requests into a queue. When the service recovered, it processed each request as a unique transaction, unaware that the second request was a retry, resulting in a double charge.

From the user’s perspective, retrying is reasonable. From the system’s perspective, each request looked unique, so duplicates were created. Nothing was technically a bug; every step made sense in isolation.

Why is this dangerous?

Partial failures put systems in an “in‑between” state:

  • User view: “It didn’t work.”
  • System view: “Part of it did work.”

This creates a divergent truth where some operations succeed, others fail, and the system and user disagree about the outcome.

Lessons learned

  • Idempotency – Design operations to be idempotent so that retries do not cause side‑effects such as double charges.
  • Transactional guarantees – Use atomic transactions or two‑phase commits where appropriate to ensure that either all steps succeed or none do.
  • Visibility & monitoring – Surface partial‑failure states in dashboards and alerts so operators can act before inconsistencies grow.

Failure Pattern #3: [Placeholder for Third Pattern]

(The original content ended before describing the third pattern. Insert the appropriate description, dangers, and lessons here when available.)

Additional Failure Patterns

Failure Pattern #2: Duplicate Requests

Duplicate requests are unavoidable; users refresh pages, client apps resend requests, or other systems retry automatically. The backend must assume the scenario:

“this request might be sent more than once”

and handle it properly. Systems can use request identifiers to achieve idempotency, allowing them to treat retries from the same source as the same request and avoid duplicates by either:

  • responding with the result of the first request, or
  • discarding the earlier request and processing the latest one (to each system its own).

Failure Pattern #3: Silent Failures

Silent failures are by far the deadliest because they are the hardest to notice. A background job could fail quietly, or a report may not generate. At first glance, everything seems fine until someone notices a mismatch days later.

Silent failures don’t necessarily mean the system crashed; they usually occur when any of the following happen:

  • The system continues operating
  • Requests appear successful
  • No alerts are fired
  • Dashboards look “fine”

…but the business outcome is wrong.

Essentially, silent failures are failure modes where error signals do not propagate to the layer that observes correctness. An operation as simple as a cache write failing or an event being published but never consumed can be an indicator of a silent failure.

Why is this dangerous?

For silent failures, users of a system may never notice immediately, and teams assume everything is fine. This leads to accumulated problems until the business is impacted—for example:

  • Orders without payments
  • Payments without invoices being sent out
  • Events not being consumed

The backend then carries “historical corruption”. In many cases, fixing the bug doesn’t fix the damage because new data is correct, but old data remains wrong. Teams must employ techniques such as:

  • Back‑filling data
  • Event re‑processing
  • One‑off migration scripts

Lessons Learned

  • Observability – Essential to every backend system. When done incorrectly, it becomes practically useless. Proper observability lets you know if the system is doing the right thing.
  • Logging – Should track the entity involved (e.g., orderID, transactionReferenceID), the reason for failure, and the next steps. Good logs enable alerts, trace flows, and faster detection of silent failures.
  • Metrics – Crucial for detecting silent failures. Since silent errors occur when requests succeed, domain metrics such as orders_count_total, events_published_total, completed_payments_total, abandoned_payments_total, etc., are helpful. Use them to assert relationships or raise alerts.
    • Example: raise an alert if abandoned_payments_total exceeds a threshold or if orders_count_total and completed_payments_total diverge significantly.
  • Alerts – Only matter if they’re actionable. An alert that merely says “Error rate increased” is noise; it lacks sufficient information to act upon. Actionable alerts should tell the user what broke, where to look, and why it matters.

In summary: if an alert doesn’t tell you what to do next, it’s noise.

It is also imperative to understand that “working” is not “correct”, especially when dealing with silent failures. While backend systems optimize for availability, throughput, and resilience, they should also optimize for correctness.

Conclusion

Production failures rarely stem from “bad code.” They arise because real‑world conditions—resource limits, network partitions, third‑party outages, and human behavior—cannot be fully reproduced in test environments. Understanding the three failure patterns, why they’re dangerous, and applying the corresponding lessons (timeouts, circuit breakers, idempotency, etc.) can dramatically improve system resilience.

Conclusion

Production failures don’t start when alerts fire—they start when assumptions go unchecked. The goal isn’t zero failure; it’s failure you can see, understand, and recover from. Production systems don’t fail loudly by default; they fail quietly—unless we design them not to.

As engineers, we must:

  1. Take failure patterns into account when building systems.
  2. Accept that production issues are inevitable.
  3. Respond to them in ways that keep our systems robust over time.
0 views
Back to Blog

Related posts

Read more »

Internal SDK for TAC Backend Services

Overview This package provides a standardized, shared Software Development Kit SDK for backend services within the TAC. It centralizes API clients, business‑lo...