[Paper] Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training

Published: 1 day ago (March 2, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.02208v1

Overview

The paper presents Reasoning Core, a new open‑source suite that can generate massive amounts of verifiable symbolic reasoning data on the fly. By procedurally creating tasks such as planning problems, first‑order logic statements, grammar parsing, Bayesian‑network causality, and systems of equations, the authors give language‑model researchers a way to pre‑train or fine‑tune models on data that is both exactly checkable and continuously scalable. Their experiments show that sprinkling this data into a model’s pre‑training mix boosts downstream reasoning abilities without hurting (and sometimes even improving) raw language‑model performance.

Key Contributions

Procedural generator suite covering five core formal domains (PDDL planning, FOL with equality, CFG parsing, Bayesian‑network causal reasoning, and linear equation solving).
External solvers attached to each generator for automatic, rigorous verification of every sample.
Difficulty‑curriculum control that lets users dial the complexity of generated instances on a smooth scale.
Optional reasoning traces (step‑by‑step solver outputs) that can be used for supervised learning from the earliest pre‑training stages.
Unified API that also supplies verifiable reward functions for reinforcement‑learning experiments.
Empirical evidence that mixing Reasoning Core data into large‑scale pre‑training improves zero‑shot reasoning on benchmark tasks while preserving language‑model perplexity.

Methodology

Task Generation – For each formal domain, a lightweight procedural engine randomly instantiates problem parameters (e.g., objects, predicates, grammar rules, network topology). The randomness is seeded so that the same “difficulty level” yields comparable challenge across runs.
Solver Verification – An off‑the‑shelf exact solver (e.g., a PDDL planner, a first‑order theorem prover, a CFG parser, a Bayesian inference engine, a linear‑system solver) runs on the generated instance. If the solver finds a solution, the instance is kept; otherwise it is discarded, guaranteeing that every retained example is ground‑truth correct.
Trace Extraction (optional) – The solver can emit a detailed proof or execution trace (e.g., plan steps, resolution steps, parse tree, variable assignments). These traces are stored alongside the raw problem statement, providing a supervised signal.
Curriculum Scheduling – Difficulty is encoded as a numeric knob (e.g., number of objects, depth of logical formulas, size of the Bayesian network). Researchers can sample uniformly, bias toward harder examples, or follow a curriculum that gradually raises the difficulty as training progresses.
Integration with Language‑Model Training – Generated (problem, solution) pairs are tokenized and mixed into the usual next‑token prediction objective. For RL‑style experiments, the suite also returns a deterministic reward (e.g., 1 if the model’s answer matches the solver’s, 0 otherwise).

Results & Findings

Experiment	Setup	Main Metric	Outcome
Pre‑training mix (Reasoning Core + standard web text)	10 B‑token model, 5 % Reasoning Core data	Zero‑shot logical reasoning (MATH, ProofWriter)	+8–12 % absolute accuracy over baseline
Language‑model quality	Same mix, evaluate perplexity on WikiText‑103	Perplexity	Slightly lower (better) perplexity, ≈ 0.3 % improvement
Curriculum vs. uniform sampling	Fixed vs. gradually increasing difficulty	Reasoning benchmark scores	Curriculum yields ~3 % higher accuracy on hardest tasks
Trace‑supervised pre‑training	Include solver traces as auxiliary targets	Downstream reasoning	Additional 2–4 % boost on proof‑generation tasks
Zero‑shot on frontier model (GPT‑5)	Prompted with unseen Reasoning Core tasks	Success rate	Only ~30 % of tasks solved, confirming difficulty

Overall, the data does not degrade the model’s ability to generate fluent text, and it significantly lifts performance on symbolic reasoning benchmarks that are otherwise hard for pure language‑model pre‑training.

Practical Implications

Better reasoning for downstream tools – Developers building code assistants, automated theorem provers, or planning bots can now pre‑train on data that mirrors the logical structure of their target tasks, leading to more reliable outputs.
Curriculum‑driven fine‑tuning – The difficulty knob enables a “progressive overload” strategy: start with simple puzzles, then gradually introduce harder ones, much like human learning. This can reduce the number of fine‑tuning steps needed to reach a target accuracy.
Reinforcement‑learning environments – Because each instance comes with a deterministic reward, the suite can serve as a sandbox for RL research on symbolic reasoning (e.g., teaching agents to plan or solve equations).
Open‑source and extensible – The MIT‑licensed code can be dropped into existing data pipelines, and the modular design makes it straightforward to add new domains (e.g., graph‑theoretic problems, type‑theory exercises).
Benchmark generation – Researchers can generate custom, verifiable test sets on demand, eliminating the need to manually curate or manually verify symbolic datasets.

Limitations & Future Work

Solver bottleneck – Generating and verifying large volumes of data is compute‑intensive; scaling to trillions of tokens may require distributed solver farms or approximate verification.
Domain coverage – While the five core domains are broad, many real‑world reasoning tasks (e.g., probabilistic programming, higher‑order logic) are not yet represented.
Transfer gap – The observed gains, though consistent, are modest for very large models (e.g., GPT‑5), suggesting diminishing returns as model capacity grows.
Human‑readability – Some generated instances (especially large Bayesian networks) can be unwieldy for humans to inspect, limiting manual debugging.

Future work could explore adaptive difficulty scheduling driven by model performance, integrate approximate solvers for faster data generation, and expand the suite to cover domain‑specific reasoning (e.g., security policy analysis, hardware verification).

Reasoning Core opens a practical pathway for developers to inject rigorously verified symbolic reasoning into the massive pre‑training pipelines that power today’s language models, bridging the gap between raw text fluency and logical competence.

Authors

Valentin Lacombe
Valentin Quesnel
Damien Sileo

Paper Information

arXiv ID: 2603.02208v1
Categories: cs.CL
Published: March 2, 2026
PDF: Download PDF

[Paper] Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Tool Verification for Test-Time Reinforcement Learning

[Paper] Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

[Paper] Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

[Paper] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)