[Paper] Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning
Source: arXiv - 2512.05105v1
Overview
This paper introduces Semantic Soft Bootstrapping (SSB), a self‑distillation recipe that lets a single language model teach itself to reason over long contexts without the heavy compute cost of reinforcement‑learning‑with‑verifiable‑rewards (RLVR). By automatically generating teacher‑student pairs from raw problem‑answer data, SSB achieves a 10 %‑plus boost on challenging math benchmarks while staying fully compatible with standard fine‑tuning pipelines.
Key Contributions
- Self‑distillation without external rewards: The same base LLM acts as both teacher and student, receiving “soft” semantic signals about correctness instead of sparse RL rewards.
- Automatic data curation: From a set of roll‑outs, the pipeline extracts the correct answer and the most common wrong answer, then feeds them back as context to produce a high‑quality step‑by‑step explanation.
- Logit‑level supervision: The student is trained to match the teacher’s full probability distribution (logits) for each token, preserving nuanced reasoning information.
- Parameter‑efficient fine‑tuning: Demonstrated on Qwen2.5‑3B‑Instruct, requiring only modest compute compared with full RLVR loops.
- Empirical gains: +10.6 % on GSM8K and +10 % on MATH500/AIME2024 over the strong GRPO RLVR baseline.
- Open‑source release: Code, model checkpoint, and curated dataset are publicly available.
Methodology
- Prompt & Roll‑out Generation – The base model receives a math problem and generates multiple answer candidates (roll‑outs).
- Filtering – Among the roll‑outs, the algorithm selects the correct answer (verified against the ground truth) and the most frequent incorrect answer.
- Contextual Re‑prompting – Both the correct and the common wrong answer are injected back into the prompt, asking the model to produce a detailed, step‑by‑step solution that leads to a verified final answer. This yields a teacher output consisting of a token sequence and its associated logits.
- Student Training – The original problem (without the extra context) is fed to the student model. The training objective is to minimize the KL‑divergence between the student’s logits and the teacher’s logits, i.e., the student learns to reproduce the teacher’s reasoning distribution from the bare question alone.
- Fine‑tuning – The process is applied in a parameter‑efficient manner (e.g., LoRA adapters) on Qwen2.5‑3B‑Instruct, producing a model that can perform long‑context chain‑of‑thought reasoning without any RL loop.
Results & Findings
| Benchmark | Baseline (GRPO) | SSB (this work) | Δ Accuracy |
|---|---|---|---|
| GSM8K (test) | ~68 % | 78.6 % | +10.6 % |
| MATH500 / AIME2024 | ~45 % | 55 % | +10 % |
- The gains are achieved without any human‑written chain‑of‑thought annotations; the teacher data is fully auto‑generated.
- Training time and GPU memory consumption are roughly 30 % lower than a comparable RLVR run, thanks to the absence of reward‑model training and policy‑gradient steps.
- Qualitative inspection shows the SSB‑trained model produces more coherent intermediate steps and fewer “hallucinated” calculations.
Practical Implications
- Lower cost for reasoning‑heavy LLMs: Companies can improve math or code‑generation capabilities using existing base models and modest fine‑tuning budgets, sidestepping expensive RL pipelines.
- Plug‑and‑play for existing APIs: Since SSB works as a standard supervised fine‑tuning step, it can be integrated into CI/CD workflows for model updates without redesigning the training stack.
- Better user‑facing explanations: The step‑by‑step outputs are more reliable, which is valuable for developer tools that need to justify suggestions (e.g., code assistants, tutoring bots).
- Dataset bootstrapping: The automatic teacher‑student pair generation can be repurposed for other domains (e.g., logic puzzles, data‑analysis queries) where ground‑truth answers exist but detailed reasoning is scarce.
Limitations & Future Work
- Domain specificity: Experiments focus on arithmetic and competition‑style math; transfer to natural‑language reasoning or programming tasks remains to be validated.
- Reliance on correct roll‑outs: The pipeline assumes at least one correct answer appears among the initial roll‑outs; for extremely hard problems this may fail.
- Model size scaling: Results are shown on a 3 B‑parameter model; it is unclear how the approach scales to 30 B+ models where logits become noisier.
- Future directions include extending SSB to multi‑modal contexts, incorporating uncertainty estimation for the “most common wrong answer,” and exploring hybrid setups that combine soft bootstrapping with lightweight reward signals for even richer supervision.
Authors
- Purbesh Mitra
- Sennur Ulukus
Paper Information
- arXiv ID: 2512.05105v1
- Categories: cs.CL, cs.AI, cs.IT, cs.LG, eess.SP
- Published: December 4, 2025
- PDF: Download PDF