When long chats quietly break builds

Published: 1 month ago (December 26, 2025 at 02:16 AM EST)

4 min read

Source: Dev.to

Context fades faster than you think

I ran a week‑long debugging thread where we iterated on a migration, API surface, and test harness. At the start I told the model: Node 18, Postgres 13, TypeScript, no native modules. After a few dozen turns the assistant began suggesting Python snippets and client libraries that only exist in newer Postgres. Each new reply still looked coherent, but it stopped obeying the constraints I had given it.

The result was a patch that passed the quick manual check but failed CI because the test DB was a MySQL replica of the production schema. The failure mode was not flashy; it was a slow context drift—a creeping mismatch between the conversation and the system I actually run.

Hallucinations sneak in through missing tools

We connect models to a schema extractor and a CI status endpoint. Those tools sometimes return partial or empty payloads, and the model then fills the gaps.

Example: a truncated schema came back from the extractor, and the model guessed column types.
The generated migration ran locally and created columns with the wrong precision.
Tests didn’t catch it because they mocked the schema layer.

Building on those guesses, later changes silently relied on the wrong columns. The chain reaction worries me more than a single hallucinated line: one small incorrect assumption in a long thread becomes a foundation for subsequent changes.

Hidden assumptions multiply over long sessions

Every time we keep a chat open, we accumulate implicit defaults. The model prefers recent tokens, so new examples and off‑hand remarks carry more weight than the original constraints.

We saw the assistant start assuming default timeouts, memory limits, and even a different authentication flow after someone pasted a snippet from another repo into the same thread.
Once the model assumes the wrong runtime or dependency version, it proposes code that is syntactically valid but operationally wrong.

To mitigate this, I started doing regular resets and adding explicit manifest blocks at the top of the prompt. When I need a comparison between approaches, I use a separate multi‑model workspace so the drift in one conversation doesn’t contaminate the other. Having a place for that comparison helped when I wanted to test alternative fixes side by side in a controlled way via a shared chat tool.

Practical verification and logging that cut the feedback loop

After several wasted rollbacks we built cheap guardrails:

Logging – Everything the model returns is logged as raw text together with the prompt and the tool outputs. This lets us diff replies over time and detect when suggestions deviate from earlier constraints.
Pipeline checks – Generated code runs through a pipeline before review:
- static type checks,
- a sandbox run with a stripped dataset,
- a set of targeted integration tests that assert environmental assumptions.
Tool validation – If the schema extractor returns fewer than N columns or a status endpoint is slow, we stop and surface the tool error to a human instead of letting the model guess.
Citation verification – For sourcing and verification I often ask the model to cite the exact doc section or changelog entry, then cross‑check using a formal research workflow rather than trusting the claim in‑chat.

Operational rules I actually follow now

Segment work – Keep long work in short segments. Reset context at natural checkpoints: before major refactors, before merges, and whenever someone new joins the thread.
Manifest requirement – Require a minimal manifest at the top of a conversation: runtime, DB, test fixtures, and a short list of forbidden changes.
Tool call logging – Log every tool call and validate responses with simple assertions.
Changelog & assumptions – Force the model to output a compact changelog and a list of assumptions it made, in machine‑readable form, so CI can reject merges that introduce new unstated assumptions.
Structured research pass – When richer verification or triangulation is needed, push the claim through a structured research pass that separates model reasoning from evidence gathering and treats external sources as ground truth rather than optional citations.

When long chats quietly break builds

Context fades faster than you think

Hallucinations sneak in through missing tools

Hidden assumptions multiply over long sessions

Practical verification and logging that cut the feedback loop

Operational rules I actually follow now

Related posts

The $0 Localization Stack for Solo .NET Developers

Building an AI-Powered Code Editor: (part 2) LLM like interpreter

Networking for DevOps (Senior-Level, Production-Focused)

# The Engineering Behind Zero-Buffer 4K Streaming: A Deep Dive into High-Performance Smart4k IPTV Architecture