Flaky Tests Are Not a Testing Problem. They're a Feedback Loop You Broke.

Published: (February 15, 2026 at 10:50 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Every retry rule in your CI pipeline is a painkiller. It suppresses the symptom, the stock of broken code keeps growing underneath, and nobody feels the pain until the whole system is addicted.

I came across a post on Hacker News that perfectly illustrates the pattern: retries everywhere, quarantining tests, adding waits, slowly losing trust in CI signal. The author asked whether flakiness is “a test problem, a product problem, or infrastructure noise.”
It’s none of those. It’s a system‑structure problem. When viewed through the lens of System Dynamics, the diagnosis becomes obvious.

The Reinforcing Loop of Retries

Every “fix” that masks a failure instead of resolving it feeds a reinforcing loop:

flowchart LR
    subgraph Symptom
        RED[RED PIPELINE] -->|Retry| GREEN[GREEN BUILD]
    end
    GREEN -->|Short‑Term Relief| BALANCING[Balancing Loop]
    BALANCING -->|Intervention| RETRY[]
    RETRY -->|Long‑Term Side‑Effect| REINFORCING[Reinforcing Loop (R1) – “The Addiction”]
    REINFORCING -->|Delay| MORE[MORE FLAKINESS]
    MORE -->|Accumulate| HIDDEN[HIDDEN BUGS]
    HIDDEN -->|Feedback| RED

Bottom loop: each time you hit retry, you feel good because the light turns green, but you are feeding R1 – hidden bugs accumulate, making the system flakier and forcing even more retries tomorrow.

Shifting the Burden

This is textbook “Shifting the Burden”, one of the classic system archetypes identified by Donella Meadows. The short‑term fix (retry) actively undermines the long‑term solution (actually fixing the bug).

Drift to Low Performance (R2)

R1 does not run alone; it drags a second reinforcing loop behind it:

flowchart LR
    ACTUAL[Actual Quality (Lots of Red)] --> PERCEIVED[Perceived Quality (It’s just noise)]
    PERCEIVED -->|Reinforcing Loop (R2) “The Erosion”| LOWER[Lower Standards]
    LOWER -->|Less debugging, more merging| ACTUAL

Because you don’t trust the CI signal, you lower your standards. Lower standards lead to merging worse code, which makes the signal even less trustworthy. The cycle repeats until the CI pipeline becomes a mere decoration.

Who Should Feel the Pain?

The original Hacker News post asked, “How do QA and engineering teams split responsibility?”
That’s the wrong question. The real question is: how do you make the pain of instability felt by the person who introduced it?
Right now, the infrastructure absorbs the pain by retrying, so developers never feel it. They keep submitting flaky code because the system lets them get away with it.

Case Study: Porting a ROS 2 Desktop Stack

I built a CI pipeline to port an entire ROS 2 Desktop stack onto two non‑officially‑supported Linux distributions — openEuler (CentOS‑based) and openKylin (Ubuntu‑based) on RISC‑V. The project involved:

  • Two different base systems
  • 973 packages
  • Zero upstream CI support

My system went through three phases that map directly to the dynamics above.

v1 – The Brute‑Force Probe

  • Pulled all 973 packages into the pipeline and let them build.
  • Triggered widespread breakages – not a failure, but a data‑mining operation.
  • Successfully built 597 packages (proving feasibility) and identified 214 dependency gaps and 151 build failures.
  • Goal: make every hidden stock of problems visible.

v2 – The Verification Engine

  • Used v1’s data to build a system that verifies before building – probing the OS environment to identify dependency gaps before consuming expensive build resources.
  • Build attempts dropped, success rate went up because garbage was no longer fed into the pipeline.
  • Repository:

v3 – Incremental Stock Management

  • Instead of tackling everything at once, I identify small batches of problematic dependencies, isolate them into manageable “stocks,” and resolve them one group at a time.
  • Subtraction, not addition.

The Addiction Within My Own CI

My own CI exhibits the same addiction pattern:

  • Virtual environments bypass system dependency conflicts.
  • Masquerade rules spoof package identities.
  • The architecture diagram in the README shows multiple “intervention” nodes – each one a band‑aid.

I know these are temporary splints, not fixes. Most teams don’t. They treat retries as “solutions.” I’m consciously aware of the technical debt I’m taking on. Being aware of the addiction and being consumed by it are two very different things.

Designing a Feedback Loop That Works

  • Identify the stock that’s poisoning the pipeline.
  • Design a feedback loop that makes the right person feel the pain.
  • The organization still needs to care – that’s often the real bottleneck, not the flaky tests themselves.

If you’re dealing with a similar “build‑first‑verify‑never” problem, the v2 Verification Engine repo shows this systems‑thinking approach applied to a real project. I’m looking for exactly these kinds of challenges.

0 views
Back to Blog

Related posts

Read more »