[Paper] Understanding Chain-of-Thought Effectiveness in Code Generation: An Empirical and Information-Theoretic Analysis

Published: (December 10, 2025 at 09:25 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.09679v1

Overview

The paper investigates why “Chain‑of‑Thought” (CoT) prompting—where a model is asked to reason step‑by‑step before emitting code—boosts the performance of large language models (LLMs) on code‑generation tasks. By combining large‑scale experiments with an information‑theoretic lens, the authors pinpoint when and how different CoT strategies help developers get correct, runnable code from LLMs.

Key Contributions

  • Systematic comparison of five CoT paradigms (Zero‑Shot, Zero‑Shot CoT, Self‑Planning, Structured CoT, Reasoning‑CoT) across multiple benchmarks and programming languages.
  • Introduction of conditional mutual information (I(Y;C|X)) as a quantitative way to reason about how much the intermediate “thought” (C) contributes to the final code (Y) given the problem description (X).
  • Empirical evidence that externally guided CoT (e.g., Structured CoT) consistently outperforms direct generation, improving Pass@1 by 5–12 % while using far fewer tokens than reflective (self‑generated) reasoning.
  • Analysis of the interaction between model size, language type system, and CoT effectiveness, showing that larger models and statically‑typed languages benefit more from high‑quality CoT.
  • Practical guidelines for selecting the right CoT strategy based on model capacity, target language, and task complexity.

Methodology

  1. Benchmarks – Six Python coding suites (e.g., HumanEval, MBPP), a multilingual suite covering 12 languages (Java, JavaScript, Rust, etc.), and a set of synthetic reasoning tasks.
  2. Models – Six open‑source LLMs ranging from 7 B to 480 B parameters, ensuring the findings span both “small” and “large” models.
  3. Prompting Paradigms
    • Zero‑Shot: Directly ask for code.
    • Zero‑Shot CoT: Add a generic “think step‑by‑step” cue.
    • Self‑Planning: Let the model first generate a plan, then code.
    • Structured CoT: Provide a fixed template (e.g., “1️⃣ Define inputs → 2️⃣ Outline algorithm → 3️⃣ Write code”).
    • Reasoning‑CoT: Allow the model to freely reason before coding.
  4. Metric – Pass@k (primarily Pass@1) for functional correctness, token count for efficiency, and conditional mutual information (I(Y;C|X)) to quantify how much the intermediate reasoning (C) reduces uncertainty about the final code (Y).
  5. Analysis – Correlate (I(Y;C|X)) with observed performance gains, and break down results by model size, language typing discipline (static vs. dynamic), and task difficulty.

Results & Findings

ParadigmAvg. Pass@1 ↑ vs. Zero‑ShotToken OverheadKey Insight
Zero‑Shot CoT–1 % to –3 % (sometimes hurts)+15 %Naïve “think step‑by‑step” can introduce noise.
Self‑Planning+3 % to +6 %+30 %Planning helps but costs more tokens.
Structured CoT+5 % to +12 %+10 %Fixed template yields high‑quality reasoning with modest token cost.
Reasoning‑CoT+4 % to +8 %+45 %Freeform reasoning improves accuracy but is inefficient.
  • Conditional Mutual Information: Higher (I(Y;C|X)) correlates strongly (r ≈ 0.78) with Pass@1 gains, confirming that useful intermediate thoughts reduce uncertainty about the final code.
  • Model Capacity: Models ≥ 70 B parameters show the biggest boost from Structured CoT; smaller models (< 13 B) gain little or even regress.
  • Language Type System: Statically‑typed languages (Java, Rust) benefit more (≈ 10 % boost) than dynamically‑typed ones (Python, JavaScript), likely because explicit reasoning clarifies type constraints.
  • Quality vs. Quantity: High‑quality CoT generated by a strong “reasoner” (e.g., 480 B model) outperforms a larger number of low‑quality CoT steps from a weaker model, emphasizing reasoning quality over sheer token count.

Practical Implications

  1. Prompt Engineers – When using LLMs ≥ 70 B for code generation, adopt a Structured CoT template (e.g., “Input → Algorithm → Code”) instead of a generic “think step‑by‑step” cue.
  2. Tooling Vendors – Embed CoT templates into IDE assistants (e.g., GitHub Copilot, Tabnine) to automatically surface a short reasoning block before the generated snippet, improving correctness without a noticeable latency penalty.
  3. Multilingual Codebases – For projects that mix static and dynamic languages, prioritize Structured CoT for the static parts (type‑heavy modules) and fall back to direct generation for quick scripts.
  4. Resource‑Constrained Environments – Small models (< 13 B) may skip CoT altogether or use a very lightweight plan (1‑line comment) to avoid performance degradation.
  5. Evaluation Pipelines – Incorporate conditional mutual information estimates as a diagnostic metric to detect when a model’s reasoning is actually informative versus when it’s just filler.

Limitations & Future Work

  • Scope of Languages – Although 12 languages were tested, the set leans toward mainstream, high‑level languages; low‑level or domain‑specific languages (e.g., SQL, Verilog) remain unexplored.
  • Prompt Templates – The Structured CoT template is hand‑crafted; automated discovery of optimal templates could yield further gains.
  • Model Diversity – Only open‑source LLMs were evaluated; proprietary models (e.g., GPT‑4) may exhibit different CoT dynamics.
  • Reasoning Quality Metric – The study uses functional correctness as a proxy; richer metrics (e.g., readability, security) are left for future research.

Bottom line: For developers leveraging large LLMs to write code, a modest, well‑structured chain‑of‑thought prompt can translate into noticeably more reliable code without a heavy token cost—provided the model is big enough and the language benefits from explicit reasoning.

Authors

  • Naizhu Jin
  • Zhong Li
  • Guang Yang
  • Tian Zhang
  • Qingkai Zeng

Paper Information

  • arXiv ID: 2512.09679v1
  • Categories: cs.SE
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »