[Paper] Verbatim Data Transcription Failures in LLM Code Generation: A State-Tracking Stress Test
Source: arXiv - 2601.03640v1
Overview
The paper “Verbatim Data Transcription Failures in LLM Code Generation: A State‑Tracking Stress Test” shines a light on a subtle but critical reliability problem: large language models (LLMs) often drop, reorder, or alter exact numeric data when they generate code. While most benchmarks focus on algorithmic correctness, this work isolates the data‑integrity aspect by asking models to copy a list of high‑precision constants into a Python script unchanged. Even a single‑digit mistake can break cryptographic protocols, calibration tables, or safety‑critical configurations, making this a practical concern for any production‑grade code‑generation pipeline.
Key Contributions
- Minimalist transcription benchmark – a compact dataset of decimal constant lists plus a tiny aggregation task that forces the model to embed the numbers verbatim.
- Prompting variants – systematic exploration of zero‑shot, few‑shot, and chain‑of‑thought prompts to see how instruction style affects fidelity.
- Exact‑string evaluation metric – a strict inclusion check that flags any deviation (missing digit, extra whitespace, rounding) as a failure, providing a clear signal of transcription errors.
- Failure taxonomy – categorization of errors into omission, substitution, reordering, and state‑tracking lapses that emerge over longer constant sequences.
- Open‑source benchmark & analysis toolkit – released on GitHub to enable reproducible stress testing of new LLMs and code‑generation pipelines.
Methodology
- Dataset construction – The authors curated several real‑world‑inspired lists (e.g., cryptographic S‑box values, sensor calibration tables) containing 10–200 high‑precision decimal numbers.
- Task definition – For each list, the model receives a prompt asking it to generate a Python function that (a) stores the constants in a list variable exactly as given and (b) returns the sum of the numbers.
- Prompt designs –
- Zero‑shot: “Write a Python function that …”
- Few‑shot: Provide a short example with a tiny constant list.
- Chain‑of‑thought: Ask the model to first “list the constants verbatim, then embed them in code.”
- Generation & capture – Models (GPT‑4, Claude‑2, Llama 2‑70B, and an open‑source fine‑tuned code model) are queried via the OpenAI/Anthropic APIs with temperature 0.0 to eliminate stochastic variation.
- Evaluation – The generated script is parsed; the constant list extracted and compared to the reference using an exact‑string match (including sign, decimal point, and trailing zeros). Any mismatch counts as a failure.
- Error analysis – Failed cases are manually inspected and labeled according to the taxonomy above, allowing the authors to quantify state‑tracking errors that grow with list length.
Results & Findings
| Model | Success Rate (≤ 20 constants) | Success Rate (≥ 100 constants) |
|---|---|---|
| GPT‑4 | 96 % | 71 % |
| Claude‑2 | 94 % | 68 % |
| Llama 2‑70B | 88 % | 45 % |
| Open‑source fine‑tuned (CodeLlama‑34B) | 81 % | 32 % |
- Length matters – All models exhibit a steep drop in fidelity once the constant list exceeds ~50 items, indicating a state‑tracking bottleneck.
- Prompting effect – Few‑shot prompts improve short‑list performance by ~2–3 pts but have negligible impact on long lists. Chain‑of‑thought prompts reduce substitution errors but increase omissions due to early truncation.
- Error patterns – The most common failure for long lists is omission (the model silently drops every 5th‑10th constant). Substitutions are rarer but often involve rounding to fewer decimal places.
- No algorithmic failures – The aggregation (sum) computation is correct whenever the constant list is reproduced, confirming that the issue is purely data transcription.
Practical Implications
- Safety‑critical code generation – When LLMs are used to scaffold cryptographic libraries, firmware configuration, or scientific data pipelines, a silent transcription error can introduce vulnerabilities or calibration drift.
- Automated CI/CD checks – Integrating the exact‑string benchmark into continuous‑integration pipelines can flag models that are not yet trustworthy for data‑heavy code generation.
- Prompt engineering guidelines – Developers should avoid relying on LLMs to copy long numeric tables verbatim; instead, consider feeding the data as an external file or using a structured prompt (e.g., JSON) that the downstream system parses.
- Tooling opportunities – The released benchmark can serve as a regression suite for new code‑generation models, encouraging vendors to expose state‑tracking metrics (e.g., token‑level attention windows) alongside traditional accuracy scores.
Limitations & Future Work
- Scope limited to Python – The benchmark currently targets a single language; extending to C/C++, Rust, or hardware description languages may reveal different failure modes.
- Exact‑string metric is strict – Minor formatting differences (e.g., extra spaces) are counted as failures even when the numeric values are unchanged; a more nuanced numeric‑tolerance metric could be added.
- Model diversity – Only a handful of commercial and open‑source models were evaluated; future work should include emerging multimodal or retrieval‑augmented LLMs that may handle long context better.
- Mitigation strategies – The paper proposes prompt variants but does not explore architectural changes (e.g., longer context windows, external memory) that could directly address the state‑tracking bottleneck.
By exposing how even state‑of‑the‑art LLMs stumble on the seemingly trivial task of copying numbers, this work gives developers a concrete stress test and a reminder: when precision matters, “just ask the model” isn’t enough—verify, validate, and, when possible, keep the raw data out of the language model’s generation loop.
Authors
- Mohd Ariful Haque
- Kishor Datta Gupta
- Mohammad Ashiqur Rahman
- Roy George
Paper Information
- arXiv ID: 2601.03640v1
- Categories: cs.SE, cs.CR
- Published: January 7, 2026
- PDF: Download PDF