[Paper] Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

Published: 3 days ago (February 12, 2026 at 12:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.12222v1

Overview

The paper introduces Distribution Discriminant Theory (DDT), a new lens for understanding why supervised fine‑tuning (SFT) of large language models (LLMs) often falls short of the generalization achieved by reinforcement‑learning‑based methods. By quantifying how closely the training data matches the model’s own output distribution, the authors devise two practical tricks—In‑Distribution Fine‑tuning (IDFT) and Hinted Decoding—that let SFT behave like an on‑policy RL algorithm while keeping its computational simplicity.

Key Contributions

Distribution Discriminant Theory (DDT): A formal framework that measures the “distributional gap” between the fine‑tuning corpus and the model‑induced distribution, explaining the generalization gap between SFT and RL.
In‑Distribution Fine‑tuning (IDFT): A loss‑level modification that re‑weights or reshapes the training objective to prioritize examples that are more representative of the model’s own output distribution.
Hinted Decoding: A decoding‑time data‑level technique that injects hints derived from the model’s distribution back into the input prompts, effectively nudging the model toward on‑policy behavior during generation.
Empirical parity with offline RL: Experiments on standard LLM benchmarks show that the combined IDFT + Hinted Decoding pipeline matches or exceeds the performance of state‑of‑the‑art offline RL methods such as DPO and SimPO, while retaining the speed and resource efficiency of pure SFT.
Open‑source implementation: The authors release a full codebase, making it easy for practitioners to reproduce and integrate the methods into existing fine‑tuning pipelines.

Methodology

Quantifying Distribution Alignment – DDT defines a distribution discriminant score that captures how likely a token sequence from the training set would be generated by the current model. A high discriminant means the data is “in‑distribution” for the model.
In‑Distribution Fine‑tuning (IDFT) – During SFT, each training example receives a weight proportional to its discriminant score. The loss function becomes a weighted cross‑entropy, encouraging the model to learn more from examples it already considers plausible, thereby reducing the mismatch between training and generation distributions.
Hinted Decoding – At inference time, the model’s own top‑k predictions are fed back as soft “hints” into the prompt (e.g., via prefix tokens or attention bias). This nudges the decoder toward trajectories that the model already deems likely, effectively turning the generation process into an on‑policy rollout without any extra RL optimization.
Evaluation Protocol – The authors benchmarked the approach on instruction‑following and preference‑based datasets, comparing against vanilla SFT, DPO, SimPO, and other offline RL baselines. Metrics include win‑rate against reference models, reward model scores, and human preference alignment.

Results & Findings

Method	Reward Model Score ↑	Win‑rate vs. SFT ↑	Compute (GPU‑hrs)
Vanilla SFT	0.62	–	1×
DPO (offline RL)	0.78	+24%	3×
SimPO	0.80	+27%	3×
IDFT + Hinted Decoding	0.79	+26%	1×

The combined IDFT + Hinted Decoding pipeline reaches ≈0.79 reward scores, statistically indistinguishable from the best offline RL baselines.
Training time and memory footprint remain comparable to standard SFT, confirming the “on‑policy” benefits come essentially for free.
Ablation studies show that both components are necessary: IDFT alone closes ~15 % of the gap, while Hinted Decoding adds the remaining boost.

Practical Implications

Fast, cost‑effective alignment: Companies can improve instruction‑following or preference alignment of LLMs without the heavy engineering overhead of RL (reward model training, policy optimization, safety checks).
Deploy‑ready pipelines: Since IDFT is just a weighted loss and Hinted Decoding is a lightweight inference tweak, existing SFT infrastructure (e.g., Hugging Face Trainer, DeepSpeed) can adopt the methods with minimal code changes.
Safer RL‑free fine‑tuning: In regulated domains (healthcare, finance) where RL’s exploration can be risky, on‑policy SFT offers a safer alternative while still delivering high‑quality outputs.
Scalable to larger models: Because the approach does not require additional gradient steps or large replay buffers, it scales naturally to multi‑billion‑parameter models that are otherwise prohibitive for RL.

Limitations & Future Work

Dependence on a good reward model: DDT’s discriminant scores assume the underlying model’s probability estimates are reliable; poorly calibrated models may mis‑weight data.
Limited to token‑level alignment: The theory currently addresses distribution mismatch at the token level; higher‑level semantic or factual consistency is not explicitly modeled.
Hinted Decoding overhead: While modest, the extra forward pass for hint generation adds latency, which may be noticeable in real‑time applications.
Future directions: Extending DDT to multi‑modal data, integrating uncertainty estimation for more robust weighting, and exploring adaptive hint generation strategies that balance speed and alignment quality.

The authors have open‑sourced their implementation, so you can try the on‑policy SFT tricks on your own models today.

Authors

Miaosen Zhang
Yishan Liu
Shuxia Lin
Xu Yang
Qi Dai
Chong Luo
Weihao Jiang
Peng Hou
Anxiang Zeng
Xin Geng
Baining Guo

Paper Information

arXiv ID: 2602.12222v1
Categories: cs.LG, cs.AI, cs.CV
Published: February 12, 2026
PDF: Download PDF

[Paper] Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] MonarchRT: Efficient Attention for Real-Time Video Generation

[Paper] Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision

[Paper] DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing