[Paper] How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Published: (April 28, 2026 at 01:52 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.25907v1

Overview

This paper tackles a common pain point when fine‑tuning large reasoning models with only output‑level feedback (e.g., “did the answer match the ground‑truth?”). When the model’s initial success probability is tiny, standard reinforcement‑learning‑from‑verifiable‑rewards (RLVR) can get stuck in a “cold‑start” plateau for an impractically long time. The authors propose a continuum of loss functions based on the Tsallis $q$‑logarithm that smoothly interpolates between pure RLVR and classic maximum‑likelihood training, dramatically speeding up the escape from cold‑start while keeping training stable.

Key Contributions

  • Tsallis loss continuum: Introduces a family $J_Q$ parameterized by $q\in[0,1]$ that bridges RL‑style exploitation ($q=0$) and density‑estimation ($q=1$).
  • Gradient‑amplification insight: Shows that every loss in the family shares the same gradient direction; the only difference is a per‑example scalar amplification $P_{\theta}^{-q}$ that re‑weights updates.
  • Theoretical escape‑time analysis: Proves that pure RLVR needs $\Omega(1/p_0)$ time to leave a cold start (where $p_0$ is the initial success rate), while the likelihood pole needs only $\Theta(\log(1/p_0))$, and intermediate $q$ values trade off speed vs. noise memorization.
  • Two practical estimators:
    1. Gradient‑Amplified RL (GARL) – samples from the prior, computes the RL gradient, then amplifies it by $P_{\theta}^{-q}$.
    2. Posterior‑Attenuated Fine‑Tuning (PAFT) – importance‑resamples from the posterior and runs a standard supervised fine‑tuning step.
  • Empirical validation: On three multi‑hop QA benchmarks (FinQA, HotPotQA, MuSiQue) GARL with $q=0.75$ eliminates cold‑start stalls where prior methods (e.g., GRPO) fail, and PAFT provides stable training on the more challenging datasets.

Methodology

  1. Problem setting – The model generates latent reasoning trajectories (chains of intermediate steps). Only the final answer can be verified, so the training signal is sparse.
  2. Tsallis‑based loss – The authors replace the usual log‑likelihood $\log p_\theta$ with the Tsallis $q$‑logarithm
    [ \log_q(p_\theta)=\frac{p_\theta^{1-q}-1}{1-q}, ]
    yielding a loss
    [ J_Q(\theta)=\mathbb{E}{\text{data}}\big[-\log_q p\theta(\text{trajectory})\big]. ]
    When $q=0$ this collapses to the RLVR objective (reward‑weighted log‑probability); when $q=1$ it becomes the standard marginal likelihood.
  3. Gradient decomposition – The gradient of $J_Q$ can be written as the RL gradient multiplied by a scalar $P_{\theta}^{-q}$, where $P_{\theta}$ is the (intractable) marginal probability of the observed answer.
  4. Monte‑Carlo estimators – Because $P_{\theta}$ cannot be computed exactly, the authors derive two unbiased (up to $O(q/(M P_{\theta}^{q+1}))$) estimators:
    • GARL draws samples from the model’s prior distribution, evaluates the reward, and scales the RL gradient by an importance weight approximating $P_{\theta}^{-q}$.
    • PAFT draws samples from an approximate posterior (using the reward as a filter), then performs ordinary supervised fine‑tuning on those samples, effectively attenuating the gradient by $P_{\theta}^{q}$.
  5. Training loop – Both estimators plug into a standard stochastic gradient descent pipeline; the only hyper‑parameter that changes the behavior is $q$.

Results & Findings

DatasetMetric (majority@16)Baseline (GRPO)GARL $q=0.75$PAFT $q=0.75$
FinQA62.158.366.5 (best)65.2
HotPotQA33.530.135.2 (unstable)47.9 (+14.4 over GRPO)
MuSiQue28.724.931.0 (high variance)34.5
  • Cold‑start rescue: On tasks where the initial success probability $p_0$ was < 1 %, GARL with $q=0.75$ escaped the plateau in a few thousand steps, whereas GRPO never left it within the same budget.
  • Stability trade‑off: Lower $q$ values (closer to pure RL) gave faster early gains but later suffered from noisy gradient spikes; PAFT’s importance‑resampling smoothed these spikes, yielding more reliable convergence on the harder HotPotQA and MuSiQue benchmarks.
  • Bias‑variance: Empirically, GARL exhibited lower gradient variance but a small bias that vanished as training progressed; PAFT had higher variance but produced semantically coherent updates (useful for debugging).

Practical Implications

  • Faster fine‑tuning of reasoning LLMs – Developers can now adapt large language models to new multi‑step reasoning tasks (e.g., financial QA, scientific literature synthesis) with far fewer reward‑signal interactions, cutting compute costs dramatically.
  • Cold‑start mitigation – When deploying a model in a new domain where correct answers are rare, setting $q\approx0.7$ and using GARL can prevent the model from getting stuck, making iterative product roll‑outs feasible.
  • Plug‑and‑play loss – The Tsallis loss is a drop‑in replacement for the usual RL‑from‑human‑feedback (RLHF) loss; only the scalar $q$ and the estimator (GARL vs. PAFT) need to be configured.
  • Better debugging – PAFT’s “posterior‑attenuated” gradients stay close to supervised fine‑tuning updates, making it easier to trace why a model is improving or failing on a particular example.
  • Potential for hybrid pipelines – Teams can start with GARL for rapid early progress, then switch to PAFT for stable fine‑tuning once the model reaches a reasonable success rate.

Limitations & Future Work

  • Intractable marginal $P_{\theta}$ – Both estimators rely on Monte‑Carlo approximations; the bias term $O\big(\frac{q}{M P_{\theta}^{q+1}}\big)$ can become non‑negligible when $P_{\theta}$ is extremely small or the sample size $M$ is limited.
  • Stability on very noisy rewards – GARL can still diverge on datasets with highly stochastic verification signals; the paper notes occasional “gradient explosions” for $q<0.5$.
  • Scalability to massive models – Experiments were run on models up to 13 B parameters; it remains open how the approach behaves on 70 B‑scale LLMs where sampling costs dominate.
  • Automatic $q$ selection – The current work treats $q$ as a manually tuned hyper‑parameter. Future research could develop adaptive schedules that anneal $q$ based on observed $p_0$ or gradient variance.
  • Broader task families – The study focused on multi‑hop QA; applying the Tsallis continuum to other reasoning‑heavy tasks (code generation, theorem proving) is an exciting next step.

Authors

  • Chu‑Cheng Lin
  • Eugene Ie

Paper Information

  • arXiv ID: 2604.25907v1
  • Categories: cs.LG, cs.AI
  • Published: April 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...