[Paper] One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
Source: arXiv - 2604.13006v1
Overview
Instruction‑tuned large language models (LLMs) are praised for delivering helpful, well‑structured answers. This paper uncovers a surprising weakness: forbidding a single common token—such as a punctuation mark or a frequent word—can cause these models to “collapse,” producing markedly shorter and less comprehensive replies. The authors demonstrate the problem across several open‑source model families and even a commercial model (GPT‑4o‑mini), and they trace the root cause to the way instruction‑tuned models plan their output.
Key Contributions
- Empirical discovery of token‑level fragility – a single lexical constraint reduces response completeness by 14‑48 % in pairwise human‑like evaluations.
- Cross‑model validation – the collapse appears in three open‑weight families (e.g., Llama‑2‑Chat, Mistral‑Instruct) and in the closed‑weight GPT‑4o‑mini, contradicting earlier claims that only format‑level constraints matter.
- Mechanistic insight – identifies a planning failure: models first generate freely, then attempt a constrained rewrite, which often aborts early, truncating the answer.
- Predictive probing – linear probes on the prompt representation can forecast the eventual response length ( R² = 0.51–0.93 ) before any token is emitted, showing that the collapse decision is encoded during instruction tuning.
- Two‑pass recovery – a simple “generate‑then‑rewrite” pipeline restores 59–96 % of the lost length, suggesting a practical mitigation.
- Evaluation gap exposure – standard LLM‑as‑judge scoring catches only a 3.5 % quality dip, while pairwise human‑like judgments reveal a 23 % drop, highlighting a blind spot in current automated evaluation pipelines.
Methodology
- Constraint Design – The authors construct minimal lexical constraints by banning a single punctuation character (e.g., “:” ) or a high‑frequency word (e.g., “the”).
- Model Suite – They test four instruction‑tuned families (Llama‑2‑Chat, Mistral‑Instruct, Mixtral‑Instruct, and GPT‑4o‑mini) alongside their respective base (non‑instruction‑tuned) counterparts.
- Prompt Set – 240 diverse instruction prompts covering coding, reasoning, and knowledge tasks are drawn from the MT‑Bench benchmark.
- Generation & Evaluation
- Unconstrained baseline: standard instruction‑tuned generation.
- Constrained generation: the same prompt with the token ban enforced via the model’s built‑in token‑level constraint API.
- Pairwise comparison: 1,920 head‑to‑head judgments made by GPT‑4o‑mini and GPT‑4o, asking which answer is more helpful/comprehensive.
- LLM‑as‑judge scoring: a conventional single‑score evaluation for comparison.
- Mechanistic Probing – Linear regression probes are trained on the hidden state of the prompt token to predict the final response length, revealing whether the model “knows” it will collapse before generation starts.
- Two‑Pass Recovery – A fallback pipeline first generates without constraints, then rewrites the output while respecting the banned token, measuring how much length can be reclaimed.
Results & Findings
| Model (Instruction‑tuned) | Avg. Comprehensiveness Loss | Baseline Win Rate (pairwise) | Recovery (Two‑Pass) |
|---|---|---|---|
| Llama‑2‑Chat | 14 % | 77 % | 59 % |
| Mistral‑Instruct | 22 % | 85 % | 71 % |
| Mixtral‑Instruct | 31 % | 92 % | 96 % |
| GPT‑4o‑mini (closed) | 31 % | 99 % | 84 % |
- Base models (no instruction tuning) show negligible, noisy effects, confirming that the fragility is introduced during instruction tuning.
- Linear probes achieve high R² on instruction‑tuned models (up to 0.93), but negative R² on base models, indicating that the “collapse decision” is encoded only after tuning.
- MT‑Bench replication shows the phenomenon across all eight task categories (coding, reasoning, summarization, etc.).
- Evaluation discrepancy: LLM‑as‑judge scores report only a 3.5 % drop, while pairwise human‑like judgments reveal a 23 % drop, exposing a systematic under‑estimation of constrained‑generation failures.
Practical Implications
- Robustness testing – Developers deploying instruction‑tuned LLMs (e.g., in chat assistants, code generators, or help‑desk bots) should include token‑level stress tests, not just format or length constraints.
- Safety & compliance – When models must avoid certain words for policy or legal reasons, the risk of “collapse” could lead to incomplete or misleading answers, undermining compliance guarantees.
- Mitigation strategies – Implementing a two‑pass generate‑then‑rewrite workflow can recover most lost content with minimal engineering overhead.
- Model selection – For applications where strict lexical constraints are unavoidable, base (non‑instruction‑tuned) models or fine‑tuned variants that explicitly train on constrained data may be safer choices.
- Evaluation pipelines – Relying solely on LLM‑as‑judge scores may mask serious degradations; incorporating pairwise or human‑in‑the‑loop assessments is advisable for high‑stakes deployments.
Limitations & Future Work
- Constraint scope – The study focuses on single‑token bans; multi‑token or semantic constraints (e.g., “no profanity”) may behave differently.
- Model diversity – While four families were examined, newer instruction‑tuned models (e.g., Claude, Gemini) were not included; generalization to them remains an open question.
- Probe simplicity – Linear probes are a coarse diagnostic; richer probing (e.g., probing attention patterns) could yield deeper mechanistic insights.
- User‑centric impact – The paper measures comprehensiveness but does not directly assess user satisfaction or downstream task success; future work could link collapse to real‑world user metrics.
- Training‑time interventions – Exploring instruction‑tuning recipes that explicitly regularize against token‑level collapse (e.g., adversarial token bans during fine‑tuning) could pre‑empt the issue.
Authors
- Erfan Baghaei Potraghloo
- Seyedarmin Azizi
- Souvik Kundu
- Massoud Pedram
Paper Information
- arXiv ID: 2604.13006v1
- Categories: cs.CL, cs.AI
- Published: April 14, 2026
- PDF: Download PDF