[Paper] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
Source: arXiv - 2602.24288v1
Overview
The paper introduces DARE‑bench, a large‑scale benchmark that evaluates how well large language models (LLMs) follow data‑science instructions and faithfully execute multi‑step modeling pipelines. By providing verifiable ground‑truth outcomes for every task, DARE‑bench fills a critical gap in existing evaluations that often rely on subjective human or model judges.
Key Contributions
- Process‑aware benchmark: 6,300 real‑world Kaggle tasks covering data cleaning, feature engineering, model selection, training, and evaluation, each with a deterministic ground‑truth answer.
- Unified training & test splits: The same dataset can be used for supervised fine‑tuning, reinforcement learning, and zero‑shot evaluation, enabling end‑to‑end research cycles.
- Objective evaluation metrics: Accuracy, F1, and regression‑specific scores are computed against the known correct model or metric, eliminating the need for noisy human ratings.
- Empirical baseline study: State‑of‑the‑art LLMs (e.g., GPT‑4‑mini, Claude‑2, LLaMA‑2) are benchmarked, revealing a substantial performance gap on modeling tasks.
- Demonstrated training gains: Fine‑tuning on DARE‑bench data yields up to 1.8× accuracy improvement for Qwen‑3‑32B (SFT) and >8× for Qwen‑3‑4B (RLHF), proving the benchmark’s dual role as an evaluation suite and a high‑quality training resource.
Methodology
- Task collection – The authors mined 6,300 Kaggle competition notebooks, extracting the full end‑to‑end data‑science workflow (data load → preprocessing → model training → metric computation).
- Ground‑truth generation – Each notebook is executed in a sandboxed environment; the final evaluation metric (e.g., accuracy, RMSE) is recorded as the gold answer.
- Prompt design – For every task, a concise natural‑language instruction is crafted (e.g., “Train a gradient‑boosted tree on the provided CSV and report the ROC‑AUC”). The prompt includes any necessary data schema but never the solution code.
- Model interaction – LLMs receive the instruction and must output a complete, runnable Python script (or a series of API calls) that reproduces the pipeline.
- Verification – The generated script is executed; the resulting metric is compared to the ground truth. A match within a small tolerance counts as a correct answer.
- Training experiments – The same task set is split into a 5‑fold training corpus for supervised fine‑tuning (SFT) and a separate reinforcement‑learning reward model (based on the ground‑truth metric) for RLHF.
Results & Findings
| Model (zero‑shot) | Avg. Accuracy (classification) | Avg. RMSE (regression) |
|---|---|---|
| GPT‑4‑mini | 38 % | 1.21 |
| Claude‑2 | 34 % | 1.34 |
| LLaMA‑2‑70B | 31 % | 1.47 |
- Modeling tasks are hardest: Even the strongest LLMs drop below 40 % on tasks that require selecting hyper‑parameters, feature engineering, or handling data leakage.
- Fine‑tuning pays off: After SFT on DARE‑bench, Qwen‑3‑32B’s classification accuracy jumps from 22 % to 40 % (≈1.8×).
- RLHF yields dramatic gains: Qwen‑3‑4B improves from 12 % to >95 % on a subset of tasks after reinforcement learning, an 8× relative boost.
- Generalization: Models fine‑tuned on DARE‑bench retain improvements on unseen Kaggle‑style tasks, indicating that the benchmark captures transferable data‑science reasoning rather than memorizing specific notebooks.
Practical Implications
- Tooling for data‑science assistants – Developers building “AI‑powered notebooks” or “LLM copilots” can use DARE‑bench to validate that their agents not only understand natural‑language prompts but also generate executable, correct pipelines.
- Curriculum for fine‑tuning – The benchmark’s large, labeled training split serves as a ready‑made curriculum for domain‑specific SFT or RLHF, dramatically reducing the data‑collection effort for companies wanting robust data‑science bots.
- Automated code review – Because every generated script is run against a sandbox, DARE‑bench can be repurposed as a regression suite for continuous integration of LLM‑based code generators.
- Benchmarking standards – The process‑aware, ground‑truth‑driven evaluation model can be extended to other multi‑step domains (e.g., DevOps automation, scientific computing), encouraging more reproducible LLM assessments.
Limitations & Future Work
- Scope of data domains – All tasks stem from Kaggle competitions, which tend to be well‑structured and tabular; the benchmark currently under‑represents unstructured data (text, images) and large‑scale production pipelines.
- Execution environment constraints – Sandbox runtimes limit the use of heavy libraries or GPU‑accelerated training, potentially biasing results toward lightweight models.
- Prompt diversity – Instructions are relatively uniform; future versions could explore more ambiguous or conversational prompts to test robustness.
- Long‑term reasoning – While DARE‑bench captures multi‑step pipelines, it does not evaluate iterative model debugging or interactive data‑exploration loops, an area ripe for extension.
Bottom line: DARE‑bench offers a concrete, reproducible yardstick for measuring how well LLMs can act as true data‑science partners, and its publicly available training set opens the door for rapid, targeted model improvement in this high‑impact domain.
Authors
- Fan Shu
- Yite Wang
- Ruofan Wu
- Boyi Liu
- Zhewei Yao
- Yuxiong He
- Feng Yan
Paper Information
- arXiv ID: 2602.24288v1
- Categories: cs.AI, cs.CL
- Published: February 27, 2026
- PDF: Download PDF