[Paper] From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning

Published: (December 1, 2025 at 01:27 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.01970v1

Overview

The paper investigates how reinforcement learning (RL) can turn simple reasoning abilities into more powerful, compositional ones. By breaking a complex “complementary reasoning” task into two atomic skills—Parametric Reasoning (using internal knowledge) and Contextual Reasoning (leveraging external information)—the authors show that RL can synthesize these primitives into a robust, generalizable strategy, but only after the model has first mastered each skill through supervised fine‑tuning (SFT).

Key Contributions

  • Atomic‑to‑Composite Framework: Formalizes complementary reasoning as the composition of two decoupled atomic tasks, enabling clean experimental control.
  • SFT Generalization Paradox: Demonstrates that models trained only on the composite task achieve near‑perfect in‑distribution scores yet completely fail on out‑of‑distribution (OOD) compositional tests.
  • RL as a Reasoning Synthesizer: Shows that RL does not merely amplify existing probabilities; it can learn to combine atomic skills into novel reasoning pathways.
  • Atomic Prerequisite Insight: Identifies a strict requirement: RL can only succeed if the base model has already mastered the individual atomic skills via SFT.
  • Scalable Training Pipeline: Proposes a two‑stage recipe—first SFT on atomic tasks, then RL on the composite task—that yields strong OOD generalization without explicit supervision on every possible composition.

Methodology

  1. Synthetic Biography Dataset: The authors generate a controlled set of human biographies where each entry contains both parametric facts (e.g., birth year) and contextual clues (e.g., a referenced event).
  2. Task Decomposition:
    • Parametric Reasoning – answer questions that can be solved using only the model’s internal knowledge base.
    • Contextual Reasoning – answer questions that require extracting and using information from the provided biography.
    • Composite (Complementary) Reasoning – answer questions that need both pieces of information together.
  3. Training Regimes:
    • SFT‑Only: Fine‑tune a language model on the composite task alone.
    • Atomic‑SFT + RL: First fine‑tune separately on the two atomic tasks, then apply RL (policy gradient) on the composite task, rewarding correct multi‑step reasoning.
  4. Generalization Benchmarks: Three difficulty tiers are evaluated:
    • I.I.D. – test data drawn from the same distribution as training.
    • Composition – novel combinations of known atomic patterns.
    • Zero‑Shot – entirely new relational structures never seen during training.

Results & Findings

Training SetupI.I.D. AccuracyComposition AccuracyZero‑Shot Accuracy
SFT‑Only (Composite)~99%~45%~12%
Atomic‑SFT + RL~97%84%71%
  • SFT‑Only models excel when the test follows the training distribution but collapse when required to recombine skills in unseen ways.
  • RL‑augmented models retain high in‑distribution performance while dramatically improving OOD generalization, especially in the hardest Zero‑Shot setting.
  • Ablation experiments confirm that removing either atomic pre‑training step destroys the RL benefit, underscoring the atomic prerequisite.

Practical Implications

  • Modular Skill Development: Developers can train language models on narrowly defined primitives (e.g., factual lookup, context extraction) before asking them to solve more intricate tasks, reducing the need for massive labeled composite datasets.
  • Robust AI Assistants: For applications like personal assistants, customer support bots, or code‑generation tools that must blend internal knowledge with user‑provided context, the two‑stage pipeline promises better handling of novel request patterns.
  • Cost‑Effective RL: Since RL is only applied after the model already knows the basics, the policy‑gradient phase converges faster and requires fewer environment interactions than end‑to‑end RL on the full task.
  • Safety & Explainability: By forcing the model to rely on explicit atomic skills, it becomes easier to audit which knowledge source (internal vs. external) contributed to a decision, aiding transparency and debugging.

Limitations & Future Work

  • Synthetic Domain: The experiments use a curated biography dataset; real‑world texts (e.g., news articles, codebases) may introduce noise and ambiguities not captured here.
  • Scalability to Large Models: The study focuses on mid‑size language models; it remains open how the findings translate to billion‑parameter models with richer internal knowledge.
  • Reward Design: The RL reward is binary (correct/incorrect). More nuanced reward shaping (e.g., partial credit for correct sub‑steps) could further improve learning efficiency.
  • Extension to Multi‑Modal Reasoning: Future work could explore whether the atomic‑to‑composite pipeline works when one of the primitives involves non‑textual modalities (images, tables, code).

Bottom line: By first teaching a model to master simple, well‑defined reasoning skills and then letting RL stitch those skills together, we can build systems that generalize far beyond the data they were explicitly trained on—opening a practical path toward truly compositional AI.

Authors

  • Sitao Cheng
  • Xunjian Yin
  • Ruiwen Zhou
  • Yuxuan Li
  • Xinyi Wang
  • Liangming Pan
  • William Yang Wang
  • Victor Zhong

Paper Information

  • arXiv ID: 2512.01970v1
  • Categories: cs.AI, cs.CL
  • Published: December 1, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »