[Paper] Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization
Source: arXiv - 2602.10048v1
Overview
Large language models (LLMs) have become adept at “Chain‑of‑Thought” (CoT) prompting, where the model spells out step‑by‑step reasoning before giving an answer. While this often boosts accuracy on hard problems, the generated reasoning can be overly long, inflating inference latency and token‑based costs. The paper introduces Fine‑grained Group policy Optimization (FGO)—a reinforcement‑learning (RL) technique that trims CoT sequences without sacrificing performance, making LLM‑driven reasoning more production‑ready.
Key Contributions
- FGO algorithm: Extends Group Relative Policy Optimization (GRPO) with fine‑grained weighting based on token length and output entropy, enabling selective compression of CoT steps.
- Entropy‑aware weighting: Prevents the “entropy collapse” issue in GRPO, ensuring the model retains diverse, informative reasoning paths.
- Improved data efficiency: Re‑uses intermediate group responses more effectively, reducing the amount of RL‑training data needed.
- Empirical validation: Demonstrates on math‑heavy benchmarks (MATH500, AIME24, AMC23, Minerva) that compressed CoTs achieve near‑identical accuracy while cutting token usage by up to 35 %.
- Open‑source implementation: Provides code and pretrained policy checkpoints for easy integration with existing LLM pipelines.
Methodology
- Group Formation – During inference, the model’s CoT is split into groups (e.g., each logical sub‑step).
- Fine‑grained Subdivision – Each group is further broken down into smaller fragments. The algorithm evaluates each fragment’s length (shorter fragments are cheaper) and entropy (higher entropy indicates more informative content).
- Weight Assignment – Fragments receive a weight that balances brevity and informativeness. High‑entropy, short fragments get higher priority.
- Policy Optimization – Using RL, the policy learns to select the optimal weighted combination of fragments that maximizes a reward combining accuracy (correct final answer) and efficiency (reduced token count).
- Training Loop – The process iterates over a batch of reasoning examples, updating the policy via the Fine‑grained Group Policy Optimization objective, which is a refined version of GRPO that explicitly penalizes entropy collapse and encourages better reuse of past group data.
The overall pipeline can be dropped into any existing CoT‑enabled LLM service with minimal engineering overhead.
Results & Findings
| Benchmark | Baseline CoT (tokens) | FGO‑Compressed CoT (tokens) | Accuracy Δ |
|---|---|---|---|
| MATH500 | 1.42 M | 0.93 M (−34 %) | –0.2 % |
| AIME24 | 0.78 M | 0.52 M (−33 %) | –0.1 % |
| AMC23 | 0.64 M | 0.44 M (−31 %) | 0.0 % |
| Minerva | 1.10 M | 0.71 M (−35 %) | –0.3 % |
- Token savings: Across all datasets, FGO reduces the number of generated tokens by roughly one‑third.
- Performance preservation: The drop in accuracy is negligible (≤ 0.3 %), confirming that the compressed reasoning still carries the essential logical content.
- Stability: Training curves show that FGO converges faster than GRPO and avoids the sharp entropy dip that previously caused degenerate policies.
Practical Implications
- Lower inference cost – For SaaS providers charging per token (e.g., OpenAI, Anthropic), a 30 % reduction translates directly into cheaper API usage, especially for heavy reasoning workloads like tutoring bots or automated theorem provers.
- Reduced latency – Shorter CoTs mean fewer round‑trips in the model’s decoder, cutting response times—a win for real‑time assistants and interactive coding tools.
- Scalable reasoning services – Companies can serve more concurrent users on the same hardware budget, making LLM‑based problem‑solving viable at scale.
- Easier integration – Because FGO works as a post‑processing policy on top of any base LLM, developers can retrofit existing pipelines (e.g., LangChain, LlamaIndex) without retraining the entire model.
- Potential for other domains – The same fine‑grained weighting idea could compress verbose outputs in code generation, data‑to‑text, or legal document drafting, where brevity matters.
Limitations & Future Work
- Domain specificity – Experiments focus on mathematical reasoning; effectiveness on narrative or open‑ended tasks remains untested.
- RL overhead – While inference is cheaper, the RL fine‑tuning step adds a one‑time computational cost that may be non‑trivial for very large models.
- Heuristic weighting – The current length‑entropy trade‑off is hand‑crafted; learning a more adaptive weighting scheme could further improve compression.
- User control – Future work could expose a “compression budget” API, letting developers specify a target token count or latency constraint.
The authors suggest extending FGO to multimodal reasoning (e.g., vision‑language chains) and exploring curriculum‑style training where the policy gradually learns to compress increasingly complex CoTs.
Authors
- Xinchen Han
- Hossam Afifi
- Michel Marot
- Xilu Wang
- Lu Yin
Paper Information
- arXiv ID: 2602.10048v1
- Categories: cs.LG, cs.AI
- Published: February 10, 2026
- PDF: Download PDF