[Paper] From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning

Published: 3 days ago (December 1, 2025 at 01:27 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.01970v1

Overview

The paper investigates how reinforcement learning (RL) can turn simple reasoning abilities into more powerful, compositional ones. By breaking a complex “complementary reasoning” task into two atomic skills—Parametric Reasoning (using internal knowledge) and Contextual Reasoning (leveraging external information)—the authors show that RL can synthesize these primitives into a robust, generalizable strategy, but only after the model has first mastered each skill through supervised fine‑tuning (SFT).

Key Contributions

Atomic‑to‑Composite Framework: Formalizes complementary reasoning as the composition of two decoupled atomic tasks, enabling clean experimental control.
SFT Generalization Paradox: Demonstrates that models trained only on the composite task achieve near‑perfect in‑distribution scores yet completely fail on out‑of‑distribution (OOD) compositional tests.
RL as a Reasoning Synthesizer: Shows that RL does not merely amplify existing probabilities; it can learn to combine atomic skills into novel reasoning pathways.
Atomic Prerequisite Insight: Identifies a strict requirement: RL can only succeed if the base model has already mastered the individual atomic skills via SFT.
Scalable Training Pipeline: Proposes a two‑stage recipe—first SFT on atomic tasks, then RL on the composite task—that yields strong OOD generalization without explicit supervision on every possible composition.

Methodology

Synthetic Biography Dataset: The authors generate a controlled set of human biographies where each entry contains both parametric facts (e.g., birth year) and contextual clues (e.g., a referenced event).
Task Decomposition:
- Parametric Reasoning – answer questions that can be solved using only the model’s internal knowledge base.
- Contextual Reasoning – answer questions that require extracting and using information from the provided biography.
- Composite (Complementary) Reasoning – answer questions that need both pieces of information together.
Training Regimes:
- SFT‑Only: Fine‑tune a language model on the composite task alone.
- Atomic‑SFT + RL: First fine‑tune separately on the two atomic tasks, then apply RL (policy gradient) on the composite task, rewarding correct multi‑step reasoning.
Generalization Benchmarks: Three difficulty tiers are evaluated:
- I.I.D. – test data drawn from the same distribution as training.
- Composition – novel combinations of known atomic patterns.
- Zero‑Shot – entirely new relational structures never seen during training.

Results & Findings

Training Setup	I.I.D. Accuracy	Composition Accuracy	Zero‑Shot Accuracy
SFT‑Only (Composite)	~99%	~45%	~12%
Atomic‑SFT + RL	~97%	84%	71%

SFT‑Only models excel when the test follows the training distribution but collapse when required to recombine skills in unseen ways.
RL‑augmented models retain high in‑distribution performance while dramatically improving OOD generalization, especially in the hardest Zero‑Shot setting.
Ablation experiments confirm that removing either atomic pre‑training step destroys the RL benefit, underscoring the atomic prerequisite.

Practical Implications

Modular Skill Development: Developers can train language models on narrowly defined primitives (e.g., factual lookup, context extraction) before asking them to solve more intricate tasks, reducing the need for massive labeled composite datasets.
Robust AI Assistants: For applications like personal assistants, customer support bots, or code‑generation tools that must blend internal knowledge with user‑provided context, the two‑stage pipeline promises better handling of novel request patterns.
Cost‑Effective RL: Since RL is only applied after the model already knows the basics, the policy‑gradient phase converges faster and requires fewer environment interactions than end‑to‑end RL on the full task.
Safety & Explainability: By forcing the model to rely on explicit atomic skills, it becomes easier to audit which knowledge source (internal vs. external) contributed to a decision, aiding transparency and debugging.

Limitations & Future Work

Synthetic Domain: The experiments use a curated biography dataset; real‑world texts (e.g., news articles, codebases) may introduce noise and ambiguities not captured here.
Scalability to Large Models: The study focuses on mid‑size language models; it remains open how the findings translate to billion‑parameter models with richer internal knowledge.
Reward Design: The RL reward is binary (correct/incorrect). More nuanced reward shaping (e.g., partial credit for correct sub‑steps) could further improve learning efficiency.
Extension to Multi‑Modal Reasoning: Future work could explore whether the atomic‑to‑composite pipeline works when one of the primitives involves non‑textual modalities (images, tables, code).

Bottom line: By first teaching a model to master simple, well‑defined reasoning skills and then letting RL stitch those skills together, we can build systems that generalize far beyond the data they were explicitly trained on—opening a practical path toward truly compositional AI.

Authors

Sitao Cheng
Xunjian Yin
Ruiwen Zhou
Yuxuan Li
Xinyi Wang
Liangming Pan
William Yang Wang
Victor Zhong

Paper Information

arXiv ID: 2512.01970v1
Categories: cs.AI, cs.CL
Published: December 1, 2025
PDF: Download PDF

[Paper] From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

[Paper] Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

[Paper] Structured Document Translation via Format Reinforcement Learning

[Paper] Multi-LLM Collaboration for Medication Recommendation