[Paper] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Published: (December 5, 2025 at 01:56 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.05962v1

Overview

The paper “Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity” challenges the dominant use of Reinforcement Learning (RL) for fine‑tuning large language models (LLMs) on reasoning tasks. The authors show that RL’s implicit optimization of a mode‑seeking reverse KL divergence can collapse the model’s output diversity, and they propose a filtering‑based approach that directly balances precision and coverage.

Key Contributions

  • Explicit target distribution via filtering – Incorrect answers are removed while preserving the relative probabilities of all correct solutions.
  • α‑divergence framework – Unifies mode‑seeking (reverse KL) and mass‑covering (forward KL) objectives, giving a single knob to trade off precision vs. diversity.
  • Pareto‑optimal coverage‑precision trade‑off – Demonstrates state‑of‑the‑art results on a Lean theorem‑proving benchmark, especially improving coverage (the ability to generate many distinct correct proofs).
  • Theoretical insight – Links the loss of diversity in RL‑tuned LLMs to the “zero‑forcing” property of reverse KL, offering a principled alternative.

Methodology

  1. Collect a candidate pool – Generate a large set of possible answers (e.g., proof steps) from a pre‑trained LLM.
  2. Filter out wrong answers – Use an external verifier (a theorem prover or a classifier) to keep only the correct candidates. The remaining set defines the target distribution: each correct answer retains its original relative likelihood.
  3. Approximate the target with α‑divergence – Train the LLM to minimize an α‑divergence between its current output distribution and the filtered target.
    • When α → 0, the objective behaves like forward KL (mass‑covering → high diversity).
    • When α → 1, it behaves like reverse KL (mode‑seeking → high precision).
    • Intermediate α values let practitioners dial the desired balance.
  4. Optimization – The authors use a simple stochastic gradient descent loop, sampling from the model, re‑weighting by the α‑divergence gradient, and updating the model parameters. No RL reward‑shaping or policy‑gradient tricks are required.

Results & Findings

  • Coverage boost: On the Lean theorem‑proving suite, the proposed method generates up to 30 % more distinct correct proofs than the best RL‑based baselines.
  • Precision maintained: Even with higher coverage, the correctness rate stays comparable to RL methods, confirming the controllable trade‑off.
  • Pareto frontier: By varying α, the authors trace a smooth curve that dominates previous approaches—no other method simultaneously achieves higher coverage for a given precision level.
  • Ablation: Removing the filtering step collapses performance back to RL‑like behavior, underscoring the importance of an explicit target distribution.

Practical Implications

  • More robust code‑generation tools – Developers can obtain a richer set of valid code snippets or query rewrites, reducing the need for repeated prompting.
  • Automated theorem proving & formal verification – Higher coverage means a verifier can explore more proof strategies in parallel, speeding up verification pipelines.
  • Chatbot answer diversity – Customer‑support bots can present multiple correct solutions (e.g., troubleshooting steps) without sacrificing accuracy, improving user experience.
  • Simplified fine‑tuning pipelines – The α‑divergence approach avoids the engineering overhead of RL (reward design, policy‑gradient variance reduction), making it easier to integrate into existing MLOps workflows.

Limitations & Future Work

  • Dependence on a reliable filter – The method assumes an external verifier that can accurately label correct vs. incorrect outputs; noisy filters could degrade the target distribution.
  • Scalability of candidate generation – Generating a large pool of candidates for filtering can be computationally expensive for very large vocabularies or long‑form tasks.
  • Benchmark scope – Experiments focus on a Lean theorem‑proving benchmark; broader evaluation on code synthesis, math problem solving, or natural‑language reasoning remains to be done.
  • Future directions – The authors suggest exploring adaptive α schedules, integrating learned filters (e.g., self‑consistent models), and applying the framework to multi‑modal reasoning tasks.

Authors

  • Germán Kruszewski
  • Pierre Erbacher
  • Jos Rozen
  • Marc Dymetman

Paper Information

  • arXiv ID: 2512.05962v1
  • Categories: cs.LG, cs.AI
  • Published: December 5, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »