[Paper] ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models
Source: arXiv - 2602.20117v1
Overview
The paper introduces ReSyn, a new pipeline that automatically creates large‑scale synthetic reasoning environments paired with verifiers. By training language models with reinforcement learning on these environments, the authors demonstrate sizable improvements on a range of reasoning benchmarks, including a 27 % relative boost on the notoriously hard BBEH math suite.
Key Contributions
- ReSyn pipeline: An end‑to‑end system that generates diverse, self‑verifiable reasoning tasks (constraint satisfaction, algorithmic puzzles, spatial reasoning, etc.) without hand‑written solutions.
- Verifier‑centric supervision: Shifts the training signal from “correct answer” to “verifiable reward,” making data creation far cheaper and more scalable.
- Empirical validation: A Qwen2.5‑7B‑Instruct model fine‑tuned with RL on ReSyn outperforms strong baselines across standard reasoning benchmarks and shows strong out‑of‑domain generalisation.
- Ablation insights: Demonstrates that both the verifier‑based reward and the breadth of task families are essential for the observed gains.
Methodology
- Environment Library – The authors hand‑craft a modest set of procedural generators that can instantiate thousands of concrete problem instances on the fly (e.g., generate a random Sudoku, a graph‑coloring constraint set, or a 2‑D navigation puzzle).
- Verifier Construction – For each environment, a lightweight program checks whether a candidate solution satisfies the constraints, returning a binary reward (1 = valid, 0 = invalid). This replaces the need for human‑written answer keys.
- RL Training Loop – An LLM (Qwen2.5‑7B‑Instruct) proposes solutions; the verifier evaluates them, and a reinforcement‑learning algorithm (PPO) updates the model to maximise the verifier reward.
- Curriculum & Diversity – Tasks are sampled uniformly across environment types, ensuring the model sees a wide variety of reasoning patterns during training.
The whole pipeline runs autonomously: new instances are generated on demand, verified, and fed back into the RL optimizer, enabling massive data throughput without manual labeling.
Results & Findings
| Metric | Baseline (no RL) | RL on ReSyn | Relative Gain |
|---|---|---|---|
| BBEH (hard math) | 0.42 | 0.53 | +27 % |
| MATH | 0.58 | 0.64 | +10 % |
| ARC‑Easy | 0.71 | 0.77 | +8 % |
| Spatial‑Reasoning Suite | 0.66 | 0.73 | +11 % |
- Verifier‑only supervision already yields a 5–8 % lift over standard supervised fine‑tuning, confirming that reward‑driven learning is effective even without explicit answer annotations.
- Task diversity matters: removing half of the environment families drops performance by ~4 % on average, indicating that exposure to varied reasoning patterns is crucial for generalisation.
- The model retains its language generation quality (BLEU, perplexity) while gaining reasoning strength, suggesting that RLVR does not sacrifice fluency.
Practical Implications
- Cheaper data pipelines – Companies can generate endless training data for reasoning‑heavy applications (e.g., automated theorem proving, constraint‑based scheduling, game AI) without hiring annotators.
- Rapid prototyping of new domains – Adding a new procedural generator plus a verifier is all that’s needed to extend the training set to a novel problem space (e.g., network routing puzzles).
- Improved AI assistants – Deploying models trained with ReSyn‑style RLVR can lead to more reliable step‑by‑step problem solving in code assistants, math tutoring bots, and decision‑support tools.
- Safety & interpretability – Verifier feedback is deterministic and auditable, offering a clearer signal for alignment researchers who need to know why a model’s answer is correct.
Limitations & Future Work
- Verifier design overhead – While cheaper than full solution annotation, each new environment still requires a correct, efficient verifier, which may be non‑trivial for highly complex domains.
- Scalability to larger models – Experiments were limited to a 7 B‑parameter LLM; it remains to be seen how the approach scales to 70 B+ models where RL stability can be more fragile.
- Reward sparsity – Some environments produce very few valid solutions, leading to sparse rewards; future work could explore curriculum learning or shaped rewards to mitigate this.
- Generalisation bounds – The paper shows strong out‑of‑domain performance on benchmark suites, but real‑world tasks with noisy or ambiguous constraints may still challenge verifier‑based training.
ReSyn opens a promising path toward cost‑effective, scalable reasoning training for language models, and its blend of procedural generation and verifier‑driven reinforcement learning is poised to become a staple in the next generation of AI development pipelines.
Authors
- Andre He
- Nathaniel Weir
- Kaj Bostrom
- Allen Nie
- Darion Cassel
- Sam Bayless
- Huzefa Rangwala
Paper Information
- arXiv ID: 2602.20117v1
- Categories: cs.AI, cs.LG
- Published: February 23, 2026
- PDF: Download PDF