[Paper] Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

Published: 2 months ago (December 4, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.05105v1

Overview

This paper introduces Semantic Soft Bootstrapping (SSB), a self‑distillation recipe that lets a single language model teach itself to reason over long contexts without the heavy compute cost of reinforcement‑learning‑with‑verifiable‑rewards (RLVR). By automatically generating teacher‑student pairs from raw problem‑answer data, SSB achieves a 10 %‑plus boost on challenging math benchmarks while staying fully compatible with standard fine‑tuning pipelines.

Key Contributions

Self‑distillation without external rewards: The same base LLM acts as both teacher and student, receiving “soft” semantic signals about correctness instead of sparse RL rewards.
Automatic data curation: From a set of roll‑outs, the pipeline extracts the correct answer and the most common wrong answer, then feeds them back as context to produce a high‑quality step‑by‑step explanation.
Logit‑level supervision: The student is trained to match the teacher’s full probability distribution (logits) for each token, preserving nuanced reasoning information.
Parameter‑efficient fine‑tuning: Demonstrated on Qwen2.5‑3B‑Instruct, requiring only modest compute compared with full RLVR loops.
Empirical gains: +10.6 % on GSM8K and +10 % on MATH500/AIME2024 over the strong GRPO RLVR baseline.
Open‑source release: Code, model checkpoint, and curated dataset are publicly available.

Methodology

Prompt & Roll‑out Generation – The base model receives a math problem and generates multiple answer candidates (roll‑outs).
Filtering – Among the roll‑outs, the algorithm selects the correct answer (verified against the ground truth) and the most frequent incorrect answer.
Contextual Re‑prompting – Both the correct and the common wrong answer are injected back into the prompt, asking the model to produce a detailed, step‑by‑step solution that leads to a verified final answer. This yields a teacher output consisting of a token sequence and its associated logits.
Student Training – The original problem (without the extra context) is fed to the student model. The training objective is to minimize the KL‑divergence between the student’s logits and the teacher’s logits, i.e., the student learns to reproduce the teacher’s reasoning distribution from the bare question alone.
Fine‑tuning – The process is applied in a parameter‑efficient manner (e.g., LoRA adapters) on Qwen2.5‑3B‑Instruct, producing a model that can perform long‑context chain‑of‑thought reasoning without any RL loop.

Results & Findings

Benchmark	Baseline (GRPO)	SSB (this work)	Δ Accuracy
GSM8K (test)	~68 %	78.6 %	+10.6 %
MATH500 / AIME2024	~45 %	55 %	+10 %

The gains are achieved without any human‑written chain‑of‑thought annotations; the teacher data is fully auto‑generated.
Training time and GPU memory consumption are roughly 30 % lower than a comparable RLVR run, thanks to the absence of reward‑model training and policy‑gradient steps.
Qualitative inspection shows the SSB‑trained model produces more coherent intermediate steps and fewer “hallucinated” calculations.

Practical Implications

Lower cost for reasoning‑heavy LLMs: Companies can improve math or code‑generation capabilities using existing base models and modest fine‑tuning budgets, sidestepping expensive RL pipelines.
Plug‑and‑play for existing APIs: Since SSB works as a standard supervised fine‑tuning step, it can be integrated into CI/CD workflows for model updates without redesigning the training stack.
Better user‑facing explanations: The step‑by‑step outputs are more reliable, which is valuable for developer tools that need to justify suggestions (e.g., code assistants, tutoring bots).
Dataset bootstrapping: The automatic teacher‑student pair generation can be repurposed for other domains (e.g., logic puzzles, data‑analysis queries) where ground‑truth answers exist but detailed reasoning is scarce.

Limitations & Future Work

Domain specificity: Experiments focus on arithmetic and competition‑style math; transfer to natural‑language reasoning or programming tasks remains to be validated.
Reliance on correct roll‑outs: The pipeline assumes at least one correct answer appears among the initial roll‑outs; for extremely hard problems this may fail.
Model size scaling: Results are shown on a 3 B‑parameter model; it is unclear how the approach scales to 30 B+ models where logits become noisier.
Future directions include extending SSB to multi‑modal contexts, incorporating uncertainty estimation for the “most common wrong answer,” and exploring hybrid setups that combine soft bootstrapping with lightweight reward signals for even richer supervision.

Authors

Purbesh Mitra
Sennur Ulukus

Paper Information

arXiv ID: 2512.05105v1
Categories: cs.CL, cs.AI, cs.IT, cs.LG, eess.SP
Published: December 4, 2025
PDF: Download PDF

[Paper] Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis