Code Smell 315 - Cloudflare Feature Explosion

Published: 2 months ago (December 2, 2025 at 06:00 AM EST)

4 min read

Source: Dev.to

Source: Dev.to

TL;DR

Overly large auto‑generated configuration can crash your system.

Problems 😔

Config overload
Hardcoded limit
Lack of validations
Crash on overflow
Fragile coupling
Cascading failures
Hidden assumptions
Silent duplication
Unexpected crashes
Thread panics in critical paths
Treating internal data as trusted input
Poor observability
Single point of failure in internet infrastructure

Solutions 😃

Validate inputs early
Enforce soft limits
Fail‑fast on parse
Monitor config diffs
Version config safely
Use backpressure mechanisms
Degrade functionality gracefully
Log and continue
Improve degradation metrics
Implement proper Result/Option handling with fallbacks
Treat all configuration as untrusted input

Context 💬

In the early hours of November 18 2025, Cloudflare’s global network began failing to deliver core HTTP traffic, generating a flood of 5xx errors to end users.
The outage was not caused by an external attack or security problem; it stemmed from an internal “latent defect” triggered by a routine configuration change.

The failure fluctuated over time until a fix was fully deployed. The root cause lay in a software bug in Cloudflare’s Bot Management module and its downstream proxy logic.

Technical Chain of Events

Database Change (11:05 UTC) – A ClickHouse permissions update made previously implicit table access explicit, allowing users to see metadata from both the default and r0 databases.
SQL Query Assumption – A Bot Management query lacked a database name filter:
```
SELECT name, type FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;
```
This query began returning duplicate rows—once for default, once for r0.
Feature File Explosion – The machine‑learning feature file doubled from ~60 features to over 200, with duplicate entries.
Hard Limit Exceeded – The Bot Management module had a hard‑coded limit of 200 features (for memory pre‑allocation), which was now exceeded.
The Fatal .unwrap() – Rust code called .unwrap() on a Result that was now returning an error, causing the thread to panic with “called Result::unwrap() on an Err value”.
Global Cascade – The panic propagated across all 330+ data centers, bringing down core CDN services, Workers KV, Cloudflare Access, Turnstile, and the dashboard.

The estimated financial impact across affected businesses ranges from $180–360 million.

Sample Code 📖

Wrong ❌

let features: Vec = load_features_from_db();
let max = 200;
assert!(features.len()  Result, String> {
    let raw: Vec> = load_features_from_db();

    if raw.len() > max {
        return Err(format!(
            "too many features: {} > {}",
            raw.len(),
            max
        ));
    }

    Ok(raw.into_iter()
        .filter_map(|r| r.ok())
        .collect())
}

Detection 🔍

Search your codebase for the following patterns:

.unwrap() – direct calls to this method
.expect() – similarly dangerous
panic!() – explicit panics in non‑test code
thread::panic_any() – panic without context

When you find these patterns, ask: “What happens to my system when this Result contains an Err?” If the answer is “the thread crashes and the request fails,” you’ve identified the smell.

Automated linters can help. Most Rust style guides recommend clippy, which flags unwrap() usage in production code paths. Configure clippy with #![deny(unwrap_in_result)] to prevent new unwrap() calls from entering the codebase.

Tags 🏷️

Fail‑Fast

Level 🔋

Advanced

Why the Bijection Is Important 🗺️

Your internal config generator must map exactly what your code expects. A mismatched config (e.g., duplicated metadata) breaks the bijection between what the config represents and what the proxy code handles.

Assuming “this file will always have ≤ 200 entries” while the reality sends 400 entries causes the model to explode, leading to cascading failures. Ensuring a clean mapping between the config source and code input helps prevent crashes and unpredictable behavior.

AI Generation 🤖

AI generators often prioritize “correct” logic over “resilient” logic. If you ask an AI to “ensure the list is never larger than 200 items,” it might generate an assertion or a panic—the most direct way to satisfy the requirement—introducing this smell.

The irony: memory‑safe languages like Rust prevent undefined behavior and memory corruption, but they can’t prevent logic errors, poor error handling, or architectural assumptions. Memory safety ≠ System safety.

AI Detection 🧲

AI can be instructed to look for availability risks. Combine linters with AI to flag panic calls in production code. Human review of critical functions remains essential.

Try Them! 🛠

Remember: AI assistants make lots of mistakes.

Suggested Prompt:
“Remove all .unwrap() and .expect() calls. Return Result instead and validate the vector bounds explicitly.”