How to build custom reasoning agents with a fraction of the compute
Source: VentureBeat
Training AI Reasoning Models: Challenges and a New Paradigm
Training AI reasoning models demands resources that most enterprise teams do not have. Engineering teams are often forced to choose between distilling knowledge from large, expensive models or relying on reinforcement‑learning techniques that provide sparse feedback.
Researchers at JD.com and several academic institutions recently introduced a new training paradigm that sidesteps this dilemma. The technique, called Reinforcement Learning with Verifiable Rewards with Self‑Distillation (RLSD), combines the reliable performance tracking of reinforcement learning with the granular feedback of self‑distillation.
Experiments indicate that models trained with RLSD outperform those built on classic distillation and reinforcement‑learning algorithms. For enterprise teams, this approach lowers the technical and financial barriers to building custom reasoning models tailored to specific business logic.
The Problem with Training Reasoning Models
The standard method for training reasoning models is Reinforcement Learning with Verifiable Rewards (RLVR). In this paradigm, the model learns through trial and error, guided by a final outcome from its environment. An automated verifier checks if the model’s answer is right or wrong, providing a binary reward (e.g., 0 or 1).
“Standard GRPO has a signal density problem,” Chenxu Yang, co‑author of the paper, told VentureBeat. “A multi‑thousand‑token reasoning trace gets a single binary reward, and every token inside that trace receives identical credit, whether it’s a pivotal logical step or a throwaway phrase.”
Consequently, the model never learns which intermediate steps led to its success or failure.
On‑Policy Distillation (OPD)
Instead of waiting for a final outcome, developers pair a smaller student model with a larger, more capable teacher model. For each training example, the student compares its response to that of the teacher token by token, receiving granular feedback on the entire reasoning chain and response‑generation process.
- Drawbacks*
- Deploying and running a separate, massive teacher model alongside the student throughout training incurs massive computational overhead.
“You have to keep a larger teacher model resident throughout training, which roughly doubles your GPU footprint,” Yang said.
- The teacher and student must share the exact same vocabulary structure, which “quietly rules out most cross‑architecture, cross‑modality, or multilingual setups that enterprises actually run.”
The Promise and Failure of Self‑Distillation
On‑Policy Self‑Distillation (OPSD) emerged as a solution designed to overcome the shortcomings of the other two approaches. In OPSD, the same model plays the role of both the student and the teacher.
During training:
- The student receives a standard prompt.
- The teacher receives privileged information (e.g., a verified, step‑by‑step answer key).
- The teacher evaluates the student, providing token‑by‑token feedback as the student attempts the problem using only the standard prompt.
OPSD appears to be the perfect compromise for an enterprise budget. It delivers the granular, step‑by‑step guidance of OPD while retaining the high computational efficiency and low cost of RLVR—only requiring an extra forward pass for the teacher.
However, researchers found that OPSD suffers from a phenomenon called “privileged information leakage.”
“The objective is structurally ill‑posed,” Yang said. “There’s an irreducible mutual‑information gap that the student can never close… When self‑distillation is set up as distribution matching, the student is asked to imitate the teacher’s full output distribution under privileged context.”
Because the teacher evaluates the student based on a hidden answer key, the training objective forces the student model to learn the teacher’s exact phrasing or steps instead of the underlying reasoning logic. As a result, the student model starts hallucinating references to an invisible solution that it will not have access to in a real‑world deployment.
In practice, OPSD models show a rapid spike in performance early in training, but their reasoning capabilities soon plateau and progressively degrade over time.
Decoupling Direction from Magnitude with RLSD
The researchers behind RLSD realized that the signals governing how a model updates its parameters have fundamentally asymmetric requirements:
| Signal | Requirement |
|---|---|
| Direction of update (reinforce vs. penalize) | Can be sparse but must be perfectly reliable; pointing the model in the wrong direction damages its reasoning policy. |
| Magnitude of update (how much credit or blame a specific step deserves) | Benefits from being extremely dense to enable fine‑grained, step‑by‑step corrections. |
How RLSD Works
- Verifiable environmental feedback (the RLVR signal) strictly determines the direction of learning. The model receives overall reinforcement only if the final answer is objectively correct.
- The self‑teacher is stripped of its power to dictate what the model should generate. Instead, its token‑by‑token assessment is repurposed to determine the magnitude of the update, distributing the total credit or blame across the individual steps of the model’s reasoning path.
This decoupling alters how the model learns compared to classic OPSD. In standard OPSD, the training objective acts like behavioral cloning, forcing the model to directly copy the exact wording and phrasing of the teacher. This causes the student to hallucinate and leak references to dat…
RLSD: A Cost‑Effective Way to Provide Per‑Token Credit Information
“The intuition: we’re not teaching the model to reason like the teacher,” Yang said.
“We’re telling the model, on the path it chose, which of its own tokens were actually doing the work. The model’s exploration distribution stays its own. Only the credit allocation gets sharpened.”
Why RLSD Matters
- No hidden‑solution copying – Instead of forcing the model to mimic a hidden teacher, RLSD supplies a natural, virtually cost‑free source of per‑token credit.
- Fine‑grained reward – Tokens that strongly support the correct outcome receive a higher score; useless filler words receive only a baseline score.
- Simplified pipeline – RLSD eliminates the need for:
- Complex auxiliary reward networks
- Manually annotated step‑by‑step data
- Massive external teacher models
Putting RLSD to the Test
The researchers fine‑tuned the open‑weight Qwen3‑VL‑8B vision‑language model and evaluated it on several visual‑reasoning benchmarks:
| Benchmark | Domain |
|---|---|
| MMMU | College‑level multi‑discipline questions |
| MathVista | Mathematical reasoning over images |
| MathVision | Visual math problems |
| WeMath | Structured math tasks |
| ZeroBench | Stress‑test benchmark designed to be nearly impossible for current frontier models |
Comparison Methods
| Method | Description |
|---|---|
| Base model | No post‑training |
| GRPO (standard RLVR) | Classic reinforcement learning with reward‑weighted policy optimization |
| OPSD | Optimized per‑step distillation |
| Hybrid (GRPO + OPSD) | Combination of the two above |
| RLSD | Proposed per‑token credit allocation method |
Results
- Average accuracy across all five benchmarks: 56.18 % (highest of all methods)
- Improvement over base model: +4.69 %
- Improvement over standard RLVR (GRPO): +2.32 %
- Largest gain: MathVision benchmark, +3.91 % over standard RLVR
Efficiency Gains
- Convergence speed: RLSD reaches GRPO’s performance in 200 training steps vs. 400 steps for GRPO → ~2× faster.
- Cost overhead: Only one extra forward pass per response to capture teacher logits; negligible compared to rollout generation.
Stability
- Unlike OPSD, which spikes then collapses due to information leakage, RLSD maintains long‑term training stability and converges to a higher performance ceiling.
Qualitative Insights
Example 1: Visual Counting Task
- Standard RLVR: Rewards the entire paragraph of reasoning equally once the final answer is correct.
- RLSD: Surgically rewards the exact subtraction steps that solved the problem and down‑weights generic filler text such as “Looking at the image, I see…”.
Example 2: Incorrect Math Derivation from a Bar Chart
- Standard RLVR: Labels the whole response as a failure.
- RLSD: Places the heaviest penalty on the precise point where the model misread the chart relationship, while remaining neutral on the rest of the logical setup.
This granularity is crucial for real‑world enterprise use cases. If a model misinterprets a single assumption in a 50‑page quarterly earnings report, developers want to correct that assumption—not discard the entire analytical framework. RLSD enables token‑by‑token learning while keeping training costs reasonable.
How Enterprises Can Get Started
Prerequisites
- Verifiable reward signal – e.g., code compilers, math checkers, SQL execution, schema validators.
- Optional privileged information – Full reasoning traces (if available) or just the ground‑truth final answer.
“Tasks without verifiable reward (open‑ended dialogue, brand‑voice writing) belong in preference‑based pipelines,” Yang notes.
Flexibility Compared to OPSD
- OPSD: Requires full intermediate reasoning traces, forcing enterprises to pay annotators or distill from a frontier model.
- RLSD: Works with either full verified traces or only the final answer, offering far greater flexibility.
Integration Steps
- Choose an RL framework – e.g., veRL or EasyR1 (both open‑source, multi‑modality).
- Swap the objective – Modify a few dozen lines to adjust the GRPO objective and synchronize the teacher with the student.
- Run training – No major framework rewrite is needed; RLSD slots directly into the standard stack.
Looking Ahead
“The proprietary data enterprises hold inside their perimeter (compliance manuals, internal documentation, historical tickets, verified code snippets) is essentially free privileged information,” Yang concludes.
“RLSD lets enterprises feed this kind of data straight in as privileged context, sharpening the learning signal on smaller models without needing an external teacher and without sending anything outside the network.”