[Paper] TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

Published: 2 months ago (February 17, 2026 at 04:29 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.15449v1

Overview

The paper introduces TAROT, a novel reinforcement‑fine‑tuning framework that teaches large language models (LLMs) to write more reliable, algorithmically sophisticated code. By pairing test‑driven rewards with a capability‑aware curriculum, TAROT lets models learn from easy to hard (or vice‑versa) in a way that matches their current skill level, dramatically boosting functional correctness.

Key Contributions

Four‑tier test suite (basic, intermediate, complex, edge) for every coding problem, giving a fine‑grained difficulty map.
Capability‑adaptive curriculum: curriculum progression is chosen based on the model’s intrinsic ability rather than raw reward magnitude.
Decoupled reward and curriculum: separates the signal used for optimization from the difficulty ordering, leading to more stable gradient updates.
Empirical discovery of a “model‑dependent curriculum law”: weaker models benefit from easy‑to‑hard ordering, while stronger models excel with a hard‑first schedule.
Open‑source release of code, data, and reproducible training pipelines (GitHub: deep-diver/TAROT).

Methodology

Problem‑level test generation – For each coding prompt, the authors automatically generate four groups of unit tests that target increasing levels of logical depth and edge‑case coverage.
Capability estimation – Before fine‑tuning, a lightweight probe evaluates the base LLM’s code‑generation skill (e.g., pass rate on the basic tier).
Curriculum policy pool – Several curriculum strategies are pre‑defined (e.g., Easy→Hard, Hard→Easy, Random).
Policy selection – Using the capability estimate, TAROT selects the curriculum that maximizes expected reward gain for that model.
Reinforcement fine‑tuning – The model is trained with a standard RL‑style objective (e.g., PPO) where the reward is the proportion of passed tests in the current tier. Because the tier order is fixed by the chosen policy, the reward distribution stays balanced throughout training.

The whole pipeline is modular, so developers can plug in their own LLMs, test generators, or curriculum heuristics.

Results & Findings

Functional correctness improved up to +18 % on the HumanEval benchmark for a 2.7 B‑parameter model when using the appropriate curriculum.
Robustness to edge cases saw a +22 % lift, indicating better handling of rare or pathological inputs.
Ablation studies showed that decoupling curriculum from raw rewards reduced variance in training loss by ~30 %, leading to faster convergence.
The “easy‑to‑hard” curriculum gave the biggest boost for models under 6 B parameters, while “hard‑first” yielded the highest gains for 13 B‑plus models.

Practical Implications

Better code assistants: Integrating TAROT into the fine‑tuning stage of code‑generation products (e.g., Copilot‑style tools) can raise the pass‑rate of generated snippets, reducing the need for manual debugging.
Cost‑effective model scaling: Smaller LLMs can achieve performance comparable to larger, more expensive models by applying the right curriculum, saving compute and inference costs.
Automated test generation pipelines: The four‑tier test suite can be reused as a standard evaluation harness for any code‑generation model, simplifying CI/CD for AI‑driven development tools.
Adaptive deployment: Services can dynamically select a curriculum based on a model’s observed success rate, enabling on‑the‑fly fine‑tuning for new domains or languages.

Limitations & Future Work

The current test generator focuses on Python and may need adaptation for other languages or domain‑specific APIs.
Capability estimation relies on a simple probe; richer diagnostics (e.g., reasoning trace analysis) could further refine curriculum selection.
The study evaluates primarily on synthetic benchmarks; real‑world IDE integration and user studies are left for future research.

TL;DR

TAROT shows that “one size does not fit all” for code‑generation fine‑tuning. By matching curriculum difficulty to a model’s skill, developers can extract substantially more reliable code from LLMs without blowing up compute budgets. The open‑source toolkit makes it easy to try this approach on your own models today.

Authors

Chansung Park
Juyong Jiang
Fan Wang
Sayak Paul
Jiasi Shen
Jing Tang
Jianguo Li

Paper Information

arXiv ID: 2602.15449v1
Categories: cs.CL, cs.LG, cs.SE
Published: February 17, 2026
PDF: Download PDF

[Paper] TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

TL;DR

Authors

Paper Information

Related posts

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

[Paper] Validating Political Position Predictions of Arguments

[Paper] Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

[Paper] On the 'Induction Bias' in Sequence Models