I Intentionally Built a Bad Decision System (So You Don’t Have To)

Published: 1 month ago (December 19, 2025 at 06:00 AM EST)

5 min read

Source: Dev.to

Source: Dev.to

The Task: Same Problem, Two Implementations

Both systems solve the exact same problem:

Input text → extract keywords → compute a score → recommend an action

The action space is deliberately small:

WAIT_AND_SEE
BUY_MORE_STOCK
PANIC_REORDER

Keeping the task simple allows us to focus entirely on system behavior, not model quality.

The Benchmark Idea

The benchmark is intentionally minimal:

Take a single, fixed input text.
Run it multiple times through the system.
Observe whether the outputs stay stable.

Why this matters: a system that only works once is not a system — it’s a coincidence. If the same input produces different outputs, something is fundamentally wrong at the system level.

Benchmark Results: BAD vs GOOD

The following results were produced by running the same input five times through both systems.

BAD System Output (excerpt)

The BAD system gradually escalates its decisions:

Run	Score	Action
1	14	`WAIT_AND_SEE`
3	42	`BUY_MORE_STOCK`
5	74	`PANIC_REORDER`

Same input. Same keywords. Completely different decisions.

Aggregated Benchmark Summary

BAD system

Runs: 5
Unique scores: 5 → [14, 28, 42, 58, 74]
Unique actions: 3

GOOD system

Runs: 5
Unique scores: 1 → [14, 14, 14, 14, 14]
Unique actions: 1

The GOOD system behaves like a pure function. The BAD system behaves like a memory leak.

Failure Taxonomy: How the BAD System Breaks

The bad system does not fail in a single obvious way. Instead, it exhibits multiple interacting failure modes that are common in real‑world AI and data systems. Naming these failure modes makes them easier to detect—and harder to accidentally ship.

1️⃣ Drift

Definition: The system’s output changes over time even when the input stays exactly the same.
Root cause: Global score accumulation across runs; state that grows monotonically without reset.
Why this is dangerous:
- Business logic mutates without any explicit change.
- Historical execution order influences current decisions.
- Monitoring dashboards often miss the problem because values remain “reasonable”.

Drift is especially dangerous because it looks like learning—but it isn’t.

2️⃣ Non‑determinism

Definition: Identical inputs produce different outputs.
Root cause: Random noise injected into scoring; implicit dependency on execution history.
Why this is dangerous:
- Bugs cannot be reliably reproduced.
- Test failures become flaky and untrustworthy.
- A/B experiments lose statistical meaning.

If you can’t reproduce a decision, you can’t debug it.

3️⃣ Hidden State

Definition: Functions rely on data that is not visible in their interface or inputs.
Root cause: Global variables such as CURRENT_SCORE, LAST_TEXT, and RUN_COUNT.
Why this is dangerous:
- Code cannot be understood locally.
- Refactoring changes behavior in non‑obvious ways.
- New contributors unknowingly introduce regressions.

Hidden state turns every function call into a guessing game.

4️⃣ Silent Corruption

Definition: The system continues to run without errors while its decisions become increasingly wrong.
Root cause: No explicit failure signals; no invariants or sanity checks.
Why this is dangerous:
- Incorrect outputs propagate downstream.
- Problems surface only through business impact.
- Rollbacks become difficult or impossible.

Loud failures get fixed. Silent failures get deployed.

Why This Taxonomy Matters

These failure modes rarely appear in isolation. In the BAD system, they reinforce each other:

Hidden state enables drift.
Drift amplifies non‑determinism.
Non‑determinism hides silent corruption.

Understanding these patterns is more valuable than fixing any single bug—because the same taxonomy applies to much larger and more complex AI systems.

A Single Metric: Stability Score

To summarize system behavior, I used a single metric:

stability_score = 1 - (unique_scores / runs)

1.0 → perfectly stable
0.0 → completely unstable

Stability Results

System	Stability Score
BAD	0.0
GOOD	0.8

This one number already tells you which system you can trust.

Minimal Fixes: Four Small Patches That Change Everything

This is not a rewrite. These are surgical changes. Each patch removes an entire class of failure modes without introducing new abstractions or frameworks.

Patch 1 — Remove Global State

Before (BAD):

# global mutation + history dependence
GS.CURRENT_SCORE += base
return GS.CURRENT_SCORE

After (GOOD):

def score_keywords(keywords, text):
    return sum(len(w) % 7 for w in keywords) + len(text) % 13

What this fixes

Eliminates score drift.
Removes hidden history dependence.
Makes the function deterministic and testable.

A function that depends on global state is not a function—it’s a memory leak.

Patch 2 — Push Side‑Effects to the Boundaries

Before (BAD):

def extract_keywords(text):
    print("Extracting keywords...")
    open("log.txt", "a").write(text)
    return tokens[:k]

After (GOOD):

def extract_keywords(text):
    # Pure computation – no I/O, no printing
    return tokenize(text)[:k]

What this fixes

Removes hidden I/O side‑effects that make runs non‑deterministic.
Keeps logging separate from core logic (e.g., via a decorator or wrapper).

Patch 3 — Enforce Invariants

Before (BAD):

def compute_score(keywords):
    # No sanity checks
    return sum(len(k) for k in keywords) * random.random()

After (GOOD):

def compute_score(keywords):
    assert all(isinstance(k, str) for k in keywords), "Keywords must be strings"
    base = sum(len(k) for k in keywords)
    return base  # deterministic, no random factor

What this fixes

Detects corrupted inputs early.
Guarantees that the score stays within expected bounds.

Patch 4 — Reset Per‑Run State

Before (BAD):

RUN_COUNT += 1          # global counter never reset
CURRENT_SCORE += 5     # accumulates across runs

After (GOOD):

def run_pipeline(text):
    # Local state only
    keywords = extract_keywords(text)
    score = compute_score(keywords)
    action = decide_action(score)
    return {"score": score, "action": action}

What this fixes

Guarantees each invocation is independent.
Eliminates drift and hidden state across runs.

Additional Patch Details

Patch 3 — Make Dependencies Explicit

Before (BAD):

if GS.LAST_TEXT is not None:
    base += len(GS.LAST_TEXT) % 13

After (GOOD):

def score_keywords(keywords, text):
    base = sum(len(w) % 7 for w in keywords)
    return base + (len(text) % 13)

What this fixes

No hidden inputs.
Clear data flow.
Safe refactoring.

Patch 4 — Name the Magic Numbers

Before (BAD):

if score > 42:
    action = "PANIC_REORDER"

After (GOOD):

@dataclass(frozen=True)
class Config:
    panic_threshold: int = 42

if score > cfg.panic_threshold:
    action = "PANIC_REORDER"

What this fixes

Decisions become explainable.
Parameters become reviewable.
Behavior changes become intentional.

Summary

These four patches:

Remove hidden state
Eliminate non‑determinism
Make behavior explainable
Restore trust in the system

No agents. No frameworks. Just engineering discipline.

Final Takeaway

The BAD system works. That’s the problem.
It fails in the most dangerous way possible: plausibly and quietly.

The GOOD system is boring, predictable, and easy to reason about — which is exactly what you want in production.

Working code is not the same as a working system.

Code & Reproducibility

All code used in this article — including the intentionally broken system, the clean implementation, and the benchmark — is available on GitHub:

👉 https://github.com/Ertugrulmutlu/I-Intentionally-Built-a-Bad-Decision-System-So-You-Don-t-Have-To

If you want to reproduce the results, run:

python compare.py

The benchmark will run the same input multiple times through both systems and show, in a few lines of output, why predictability matters more than flashy abstractions.

I Intentionally Built a Bad Decision System (So You Don’t Have To)

The Task: Same Problem, Two Implementations

The Benchmark Idea

Benchmark Results: BAD vs GOOD

BAD System Output (excerpt)

Aggregated Benchmark Summary

Failure Taxonomy: How the BAD System Breaks

1️⃣ Drift

2️⃣ Non‑determinism

3️⃣ Hidden State

4️⃣ Silent Corruption

Why This Taxonomy Matters

A Single Metric: Stability Score

Stability Results

Minimal Fixes: Four Small Patches That Change Everything

Patch 1 — Remove Global State

Patch 2 — Push Side‑Effects to the Boundaries

Patch 3 — Enforce Invariants

Patch 4 — Reset Per‑Run State

Additional Patch Details

Patch 3 — Make Dependencies Explicit

Patch 4 — Name the Magic Numbers

Summary

Final Takeaway

Code & Reproducibility

Related posts

The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel

The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel

The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel

Neuro-Symbolic AI: The “Holy Grail” of Artificial Intelligence

The Task: Same Problem, Two Implementations

The Benchmark Idea

Benchmark Results: BAD vs GOOD

BAD System Output (excerpt)

Aggregated Benchmark Summary

Failure Taxonomy: How the BAD System Breaks

1️⃣ Drift

2️⃣ Non‑determinism

3️⃣ Hidden State

4️⃣ Silent Corruption

Why This Taxonomy Matters

A Single Metric: Stability Score

Stability Results

Minimal Fixes: Four Small Patches That Change Everything

Patch 1 — Remove Global State

Patch 2 — Push Side‑Effects to the Boundaries

Patch 3 — Enforce Invariants

Patch 4 — Reset Per‑Run State

Additional Patch Details

Patch 3 — Make Dependencies Explicit

Patch 4 — Name the Magic Numbers

Summary

Final Takeaway

Code & Reproducibility

Related posts

The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel

The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel

The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel

Neuro-Symbolic AI: The “Holy Grail” of Artificial Intelligence

Patch 1 — Remove Global State

Patch 2 — Push Side‑Effects to the Boundaries

Patch 3 — Enforce Invariants

Patch 4 — Reset Per‑Run State

Patch 3 — Make Dependencies Explicit

Patch 4 — Name the Magic Numbers