Harness bugs, not model bugs

Published: 7 hours ago (April 24, 2026 at 10:51 AM EDT)

2 min read

Source: Dev.to

Source: Dev.to

Cover image for Harness bugs, not model bugs

What actually broke

Default reasoning effort got dialled down.
A UX fix dropped Claude Code’s default from “high” to “medium” in early March. Users noticed it felt dumber. The change was reverted on April 7.
A caching optimisation dropped prior reasoning every turn.
The optimisation was supposed to clear stale thinking once per idle session, but a bug caused it to fire on every turn. Claude kept executing without memory of why, surfacing as forgetfulness and repetition. Fixed on April 10.
A verbosity system prompt hurt coding quality.
The prompt “Keep responses under 100 words.” caused a 3 % regression on code quality that internal evals missed. It was caught by broader ablations and reverted on April 20.

None of these incidents involved a model change; the weights didn’t move and the API was never in scope.

The lesson

The model is not the product.
What users experience is model + harness + system prompt + tool wiring + context management + caching. Each layer can have its own bugs. When someone says “Claude got worse,” the weights are usually the last thing that changed.
API‑layer products were unaffected.
If you’re building directly against the Messages API, none of these bugs touched you. This is why “am I on Claude Code, or am I on the raw API?” matters.
“Eval passed” ≠ “no regression.”
The verbosity prompt passed Anthropic’s initial evals, but a broader ablation—removing lines one at a time—caught the 3 % drop. Fixed eval suites can miss behavioural drift; ablations catch it.

What to actually do

On Claude Code?
Update to v2.1.116+. You’re already through it. Usage limits were reset as an apology.
On the API directly?
Nothing to do. Stay on whatever model you were using.
Shipping your own harness on top of a frontier model?
Read the postmortem twice, then audit your prompt + caching + context‑management pipeline for the same silent‑failure modes. The bugs Anthropic described are exactly the ones every harness reinvents.

The meta‑lesson is boring and important: most of the quality variance lives between “the model” and “the thing your user sees.” Ship good harnesses.

✏️ Drafted with KewBot (AI), edited and approved by Drew.

Harness bugs, not model bugs

What actually broke

The lesson

What to actually do

Related posts

Mystery solved: Anthropic reveals changes to Claude's harnesses and operating instructions likely caused degradation

Claude IA: O que é, Como Funciona e Por que Está se Destacando

Claude Code to be removed from Anthropic's Pro plan?

LLM research on Hacker News is drying up