[Paper] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Published: 2 months ago (December 5, 2025 at 01:56 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.05962v1

Overview

The paper “Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity” challenges the dominant use of Reinforcement Learning (RL) for fine‑tuning large language models (LLMs) on reasoning tasks. The authors show that RL’s implicit optimization of a mode‑seeking reverse KL divergence can collapse the model’s output diversity, and they propose a filtering‑based approach that directly balances precision and coverage.

Key Contributions

Explicit target distribution via filtering – Incorrect answers are removed while preserving the relative probabilities of all correct solutions.
α‑divergence framework – Unifies mode‑seeking (reverse KL) and mass‑covering (forward KL) objectives, giving a single knob to trade off precision vs. diversity.
Pareto‑optimal coverage‑precision trade‑off – Demonstrates state‑of‑the‑art results on a Lean theorem‑proving benchmark, especially improving coverage (the ability to generate many distinct correct proofs).
Theoretical insight – Links the loss of diversity in RL‑tuned LLMs to the “zero‑forcing” property of reverse KL, offering a principled alternative.

Methodology

Collect a candidate pool – Generate a large set of possible answers (e.g., proof steps) from a pre‑trained LLM.
Filter out wrong answers – Use an external verifier (a theorem prover or a classifier) to keep only the correct candidates. The remaining set defines the target distribution: each correct answer retains its original relative likelihood.
Approximate the target with α‑divergence – Train the LLM to minimize an α‑divergence between its current output distribution and the filtered target.
- When α → 0, the objective behaves like forward KL (mass‑covering → high diversity).
- When α → 1, it behaves like reverse KL (mode‑seeking → high precision).
- Intermediate α values let practitioners dial the desired balance.
Optimization – The authors use a simple stochastic gradient descent loop, sampling from the model, re‑weighting by the α‑divergence gradient, and updating the model parameters. No RL reward‑shaping or policy‑gradient tricks are required.

Results & Findings

Coverage boost: On the Lean theorem‑proving suite, the proposed method generates up to 30 % more distinct correct proofs than the best RL‑based baselines.
Precision maintained: Even with higher coverage, the correctness rate stays comparable to RL methods, confirming the controllable trade‑off.
Pareto frontier: By varying α, the authors trace a smooth curve that dominates previous approaches—no other method simultaneously achieves higher coverage for a given precision level.
Ablation: Removing the filtering step collapses performance back to RL‑like behavior, underscoring the importance of an explicit target distribution.

Practical Implications

More robust code‑generation tools – Developers can obtain a richer set of valid code snippets or query rewrites, reducing the need for repeated prompting.
Automated theorem proving & formal verification – Higher coverage means a verifier can explore more proof strategies in parallel, speeding up verification pipelines.
Chatbot answer diversity – Customer‑support bots can present multiple correct solutions (e.g., troubleshooting steps) without sacrificing accuracy, improving user experience.
Simplified fine‑tuning pipelines – The α‑divergence approach avoids the engineering overhead of RL (reward design, policy‑gradient variance reduction), making it easier to integrate into existing MLOps workflows.

Limitations & Future Work

Dependence on a reliable filter – The method assumes an external verifier that can accurately label correct vs. incorrect outputs; noisy filters could degrade the target distribution.
Scalability of candidate generation – Generating a large pool of candidates for filtering can be computationally expensive for very large vocabularies or long‑form tasks.
Benchmark scope – Experiments focus on a Lean theorem‑proving benchmark; broader evaluation on code synthesis, math problem solving, or natural‑language reasoning remains to be done.
Future directions – The authors suggest exploring adaptive α schedules, integrating learned filters (e.g., self‑consistent models), and applying the framework to multi‑modal reasoning tasks.

Authors

Germán Kruszewski
Pierre Erbacher
Jos Rozen
Marc Dymetman

Paper Information

arXiv ID: 2512.05962v1
Categories: cs.LG, cs.AI
Published: December 5, 2025
PDF: Download PDF

[Paper] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] Training-Time Action Conditioning for Efficient Real-Time Chunking

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG