[Paper] Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

Published: (May 29, 2026 at 05:29 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.31058v1

Overview

The paper introduces Atomic Decomposition and Recombination (ADR), a new way to automatically create verifiable coding challenges for Reinforcement Learning with Verifiable Rewards (RLVR). By breaking code down into tiny, reusable “atoms” and then recombining them in controlled ways, ADR can generate an endless supply of novel, difficult, and high‑quality tasks that push large language models (LLMs) to the edge of their coding abilities.

Key Contributions

  • Atomic‑level synthesis: Formalizes code as a set of atomic elements (e.g., small functions, data structures, API calls) and defines a grammar for recombining them.
  • Scalable task generation: Produces virtually unlimited verifiable code problems without relying on hand‑crafted seed expansions.
  • Improved novelty & difficulty: Empirically shows that ADR‑generated tasks are more original, harder, and more diverse than those from prior heuristic methods.
  • Cross‑domain impact: Demonstrates consistent RLVR performance gains on downstream benchmarks covering algorithms, tool usage, and data‑science pipelines.
  • Open‑source pipeline: Releases the ADR framework and a benchmark suite, enabling the community to generate custom RLVR datasets.

Methodology

  1. Atomic Decomposition

    • The authors parse existing code snippets into a library of atoms: tiny, self‑contained units such as a loop pattern, a sorting routine, a pandas transformation, or a specific API call.
    • Each atom is annotated with its input‑output contract and a verifiability tag (e.g., can be checked with unit tests).
  2. Controlled Recombination

    • A set of recombination rules governs how atoms can be stitched together while preserving syntactic correctness and logical coherence.
    • Constraints ensure that the resulting program remains verifiable: a deterministic test harness can automatically evaluate correctness.
  3. Task Generation Pipeline

    • Randomly sample a seed atom, expand it using the recombination rules, and attach a generated test suite.
    • A difficulty estimator (based on static analysis, test coverage, and estimated solution length) filters out trivial or overly noisy tasks.
  4. RLVR Training Loop

    • The synthesized tasks feed into an RLVR loop where the LLM proposes code, the verifier runs the tests, and a reward signal is back‑propagated to fine‑tune the model.

The whole process is fully automated, requiring only an initial corpus of seed code (e.g., open‑source repositories) to bootstrap the atom library.

Results & Findings

MetricADR vs. Heuristic Baselines
Originality (unique atom combos)+42%
Average difficulty (test‑failure rate)+0.27 (higher = harder)
Diversity (semantic variety)+35%
Test quality (pass‑rate of ground‑truth solutions)96% (vs. 88%)
RLVR downstream gain+4.8% on algorithmic benchmark, +5.3% on tool‑usage tasks, +6.1% on data‑science suite

In plain terms, models fine‑tuned with ADR‑generated tasks consistently outperformed those trained on previously used synthetic data, achieving noticeable lifts across a range of real‑world coding scenarios.

Practical Implications

  • Faster LLM skill scaling: Developers can now generate as many high‑quality coding challenges as needed, letting RLVR keep pace with ever‑larger models.
  • Custom curriculum creation: Teams can tailor atom libraries to their tech stack (e.g., AWS SDK, TensorFlow) and automatically produce domain‑specific RLVR tasks.
  • Better automated code reviewers: RLVR models trained with ADR can more reliably suggest bug‑free patches, because they have been exposed to a broader spectrum of verifiable patterns.
  • Reduced reliance on human‑written benchmarks: Companies can bootstrap internal coding‑assessment pipelines without manually curating test cases.
  • Open‑source ecosystem boost: The released ADR toolkit can become a community hub for sharing atom libraries and benchmark suites, accelerating research on code‑centric RL.

Limitations & Future Work

  • Atom granularity trade‑off: Very fine‑grained atoms increase combinatorial possibilities but may produce unrealistic code fragments; coarse atoms limit novelty. Finding the sweet spot remains an open problem.
  • Verification bottleneck: While tests are automatically generated, some complex tasks (e.g., performance‑critical code) still require human‑crafted validators.
  • Domain transfer: The current atom extraction focuses on Python; extending ADR to statically typed languages (Java, Rust) will need language‑specific parsing and type‑checking pipelines.
  • Curriculum scheduling: The paper treats all generated tasks equally; future work could explore adaptive curricula that gradually increase difficulty based on model performance.

Overall, ADR marks a significant step toward scalable, high‑impact RLVR training, opening the door for LLMs that can write, debug, and reason about code with far fewer human‑curated examples.

Authors

  • Jiasheng Zheng
  • Boxi Cao
  • Boxi Yu
  • Yuzhong Zhang
  • Jialun Cao
  • Yaojie Lu
  • Hongyu Lin
  • Xianpei Han
  • Le Sun

Paper Information

  • arXiv ID: 2605.31058v1
  • Categories: cs.CL, cs.SE
  • Published: May 29, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »