[Paper] DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning
Source: arXiv - 2602.11089v1
Overview
The paper introduces DataChef, a 32‑billion‑parameter LLM that can automatically design “data recipes” – end‑to‑end pipelines that turn raw data sources into training corpora tailored for a target task. By framing recipe creation as a reinforcement‑learning problem, DataChef learns to pick, synthesize, filter, and weight data components that maximize downstream performance, rivaling hand‑crafted pipelines built by expert engineers.
Key Contributions
- End‑to‑end recipe generation: Formalizes the problem of automatically producing a complete data‑processing pipeline for LLM fine‑tuning.
- Reinforcement‑learning framework: Trains DataChef with an online RL loop that uses a learned proxy reward to estimate downstream task performance without costly full model evaluations.
- Scalable 32B model: Demonstrates that a single, relatively modest‑size model can discover high‑quality recipes across diverse domains (math, coding, reasoning, etc.).
- Empirical parity with humans: On six held‑out benchmarks, DataChef‑32B’s recipes achieve performance comparable to recipes manually curated by domain experts.
- Case study – math adaptation: The recipe generated for Qwen3‑1.7B‑Base lifts its AIME’25 score to 66.7, surpassing the same base model trained with generic data.
Methodology
- Problem definition – Given a target benchmark (e.g., a math reasoning test) and a pool of raw data sources (web text, code repos, synthetic generators), the system must output a data recipe: a sequence of operations (selection, augmentation, filtering, weighting) that produces a training set.
- Recipe representation – Recipes are encoded as a series of textual commands that the model can interpret and execute (e.g., “sample 2 M math QA from source A”, “apply self‑instruct synthesis on source B”, “filter with perplexity < 10”).
- Proxy reward model – A lightweight evaluator is trained to predict the downstream task score from a candidate recipe’s metadata (size, source mix, filter thresholds). This surrogate replaces expensive full‑model fine‑tuning during RL.
- Online RL loop – DataChef generates a batch of recipes, the proxy scores them, and the highest‑scoring recipes are used to update the policy via PPO (Proximal Policy Optimization). The loop runs continuously, allowing the model to refine its recipe‑generation strategy.
- Final validation – The top‑ranked recipes are then used to actually fine‑tune the target LLM, and the true benchmark performance is measured to confirm the proxy’s reliability.
Results & Findings
| Target Task | Baseline (generic data) | Human‑crafted recipe | DataChef‑32B recipe |
|---|---|---|---|
| AIME’25 (math) | 58.2 | 66.7 | 66.7 |
| Code generation (HumanEval) | 71.4 | 78.1 | 77.5 |
| Commonsense QA (ARC‑E) | 62.0 | 68.3 | 67.9 |
| Reasoning (GSM‑8K) | 73.5 | 80.2 | 79.8 |
| … (2 other tasks) | — | — | comparable |
Takeaway: DataChef’s automatically generated pipelines consistently close the gap to expert‑designed recipes, often within 1–2 percentage points of the best human baseline. The proxy reward proved sufficiently predictive to guide RL without exhaustive fine‑tuning.
Practical Implications
- Rapid prototyping – Teams can feed a new domain (e.g., legal contracts) and a catalog of raw corpora into DataChef, receiving a ready‑to‑use training set without weeks of manual data engineering.
- Cost reduction – By avoiding trial‑and‑error fine‑tuning, organizations save GPU hours and human labor, especially valuable for smaller AI labs.
- Self‑evolving pipelines – The RL loop can be kept running as new data sources appear, continuously improving the recipe for a fixed target benchmark.
- Plug‑and‑play for existing LLMs – DataChef can be used to adapt any base model (e.g., LLaMA, Qwen) by simply swapping the base checkpoint while keeping the same recipe‑generation engine.
- Tooling ecosystem – The textual recipe format lends itself to integration with existing data‑processing frameworks (Apache Beam, Ray Datasets), enabling developers to execute the generated pipelines with familiar tooling.
Limitations & Future Work
- Proxy reward fidelity – The surrogate model may mis‑estimate performance for out‑of‑distribution recipes, leading to suboptimal exploration.
- Scalability of recipe execution – While recipe generation is cheap, actually running large‑scale data pipelines (tens of billions of tokens) still requires substantial compute.
- Domain coverage – Experiments focus on six benchmark families; broader domains (multilingual, multimodal) remain untested.
- Interpretability – The learned policies can produce opaque recipe sequences; future work could add constraints or explanations to make them more human‑readable.
- Safety & bias – Automated data synthesis and filtering may inadvertently amplify biases; incorporating ethical guardrails into the RL reward is an open direction.
Bottom line: DataChef demonstrates that LLMs can not only consume data but also design the data they need, opening a path toward more autonomous, cost‑effective model adaptation pipelines for developers and AI product teams.
Authors
- Yicheng Chen
- Zerun Ma
- Xinchen Xie
- Yining Li
- Kai Chen
Paper Information
- arXiv ID: 2602.11089v1
- Categories: cs.CL, cs.AI
- Published: February 11, 2026
- PDF: Download PDF