Harness bugs, not model bugs

Published: (April 24, 2026 at 10:51 AM EDT)
2 min read
Source: Dev.to

Source: Dev.to

Cover image for Harness bugs, not model bugs

What actually broke

  • Default reasoning effort got dialled down.
    A UX fix dropped Claude Code’s default from “high” to “medium” in early March. Users noticed it felt dumber. The change was reverted on April 7.

  • A caching optimisation dropped prior reasoning every turn.
    The optimisation was supposed to clear stale thinking once per idle session, but a bug caused it to fire on every turn. Claude kept executing without memory of why, surfacing as forgetfulness and repetition. Fixed on April 10.

  • A verbosity system prompt hurt coding quality.
    The prompt “Keep responses under 100 words.” caused a 3 % regression on code quality that internal evals missed. It was caught by broader ablations and reverted on April 20.

None of these incidents involved a model change; the weights didn’t move and the API was never in scope.

The lesson

  • The model is not the product.
    What users experience is model + harness + system prompt + tool wiring + context management + caching. Each layer can have its own bugs. When someone says “Claude got worse,” the weights are usually the last thing that changed.

  • API‑layer products were unaffected.
    If you’re building directly against the Messages API, none of these bugs touched you. This is why “am I on Claude Code, or am I on the raw API?” matters.

  • “Eval passed” ≠ “no regression.”
    The verbosity prompt passed Anthropic’s initial evals, but a broader ablation—removing lines one at a time—caught the 3 % drop. Fixed eval suites can miss behavioural drift; ablations catch it.

What to actually do

  • On Claude Code?
    Update to v2.1.116+. You’re already through it. Usage limits were reset as an apology.

  • On the API directly?
    Nothing to do. Stay on whatever model you were using.

  • Shipping your own harness on top of a frontier model?
    Read the postmortem twice, then audit your prompt + caching + context‑management pipeline for the same silent‑failure modes. The bugs Anthropic described are exactly the ones every harness reinvents.

The meta‑lesson is boring and important: most of the quality variance lives between “the model” and “the thing your user sees.” Ship good harnesses.

✏️ Drafted with KewBot (AI), edited and approved by Drew.

0 views
Back to Blog

Related posts

Read more »