[Paper] From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning
Source: arXiv - 2512.01970v1
Overview
The paper investigates how reinforcement learning (RL) can turn simple reasoning abilities into more powerful, compositional ones. By breaking a complex “complementary reasoning” task into two atomic skills—Parametric Reasoning (using internal knowledge) and Contextual Reasoning (leveraging external information)—the authors show that RL can synthesize these primitives into a robust, generalizable strategy, but only after the model has first mastered each skill through supervised fine‑tuning (SFT).
Key Contributions
- Atomic‑to‑Composite Framework: Formalizes complementary reasoning as the composition of two decoupled atomic tasks, enabling clean experimental control.
- SFT Generalization Paradox: Demonstrates that models trained only on the composite task achieve near‑perfect in‑distribution scores yet completely fail on out‑of‑distribution (OOD) compositional tests.
- RL as a Reasoning Synthesizer: Shows that RL does not merely amplify existing probabilities; it can learn to combine atomic skills into novel reasoning pathways.
- Atomic Prerequisite Insight: Identifies a strict requirement: RL can only succeed if the base model has already mastered the individual atomic skills via SFT.
- Scalable Training Pipeline: Proposes a two‑stage recipe—first SFT on atomic tasks, then RL on the composite task—that yields strong OOD generalization without explicit supervision on every possible composition.
Methodology
- Synthetic Biography Dataset: The authors generate a controlled set of human biographies where each entry contains both parametric facts (e.g., birth year) and contextual clues (e.g., a referenced event).
- Task Decomposition:
- Parametric Reasoning – answer questions that can be solved using only the model’s internal knowledge base.
- Contextual Reasoning – answer questions that require extracting and using information from the provided biography.
- Composite (Complementary) Reasoning – answer questions that need both pieces of information together.
- Training Regimes:
- SFT‑Only: Fine‑tune a language model on the composite task alone.
- Atomic‑SFT + RL: First fine‑tune separately on the two atomic tasks, then apply RL (policy gradient) on the composite task, rewarding correct multi‑step reasoning.
- Generalization Benchmarks: Three difficulty tiers are evaluated:
- I.I.D. – test data drawn from the same distribution as training.
- Composition – novel combinations of known atomic patterns.
- Zero‑Shot – entirely new relational structures never seen during training.
Results & Findings
| Training Setup | I.I.D. Accuracy | Composition Accuracy | Zero‑Shot Accuracy |
|---|---|---|---|
| SFT‑Only (Composite) | ~99% | ~45% | ~12% |
| Atomic‑SFT + RL | ~97% | 84% | 71% |
- SFT‑Only models excel when the test follows the training distribution but collapse when required to recombine skills in unseen ways.
- RL‑augmented models retain high in‑distribution performance while dramatically improving OOD generalization, especially in the hardest Zero‑Shot setting.
- Ablation experiments confirm that removing either atomic pre‑training step destroys the RL benefit, underscoring the atomic prerequisite.
Practical Implications
- Modular Skill Development: Developers can train language models on narrowly defined primitives (e.g., factual lookup, context extraction) before asking them to solve more intricate tasks, reducing the need for massive labeled composite datasets.
- Robust AI Assistants: For applications like personal assistants, customer support bots, or code‑generation tools that must blend internal knowledge with user‑provided context, the two‑stage pipeline promises better handling of novel request patterns.
- Cost‑Effective RL: Since RL is only applied after the model already knows the basics, the policy‑gradient phase converges faster and requires fewer environment interactions than end‑to‑end RL on the full task.
- Safety & Explainability: By forcing the model to rely on explicit atomic skills, it becomes easier to audit which knowledge source (internal vs. external) contributed to a decision, aiding transparency and debugging.
Limitations & Future Work
- Synthetic Domain: The experiments use a curated biography dataset; real‑world texts (e.g., news articles, codebases) may introduce noise and ambiguities not captured here.
- Scalability to Large Models: The study focuses on mid‑size language models; it remains open how the findings translate to billion‑parameter models with richer internal knowledge.
- Reward Design: The RL reward is binary (correct/incorrect). More nuanced reward shaping (e.g., partial credit for correct sub‑steps) could further improve learning efficiency.
- Extension to Multi‑Modal Reasoning: Future work could explore whether the atomic‑to‑composite pipeline works when one of the primitives involves non‑textual modalities (images, tables, code).
Bottom line: By first teaching a model to master simple, well‑defined reasoning skills and then letting RL stitch those skills together, we can build systems that generalize far beyond the data they were explicitly trained on—opening a practical path toward truly compositional AI.
Authors
- Sitao Cheng
- Xunjian Yin
- Ruiwen Zhou
- Yuxuan Li
- Xinyi Wang
- Liangming Pan
- William Yang Wang
- Victor Zhong
Paper Information
- arXiv ID: 2512.01970v1
- Categories: cs.AI, cs.CL
- Published: December 1, 2025
- PDF: Download PDF