[Paper] Agentic Critical Training
Source: arXiv - 2603.08706v1
Overview
The paper introduces Agentic Critical Training (ACT), a new reinforcement‑learning (RL) framework that teaches large language model (LLM) agents not just what to do, but why one action is better than another. By rewarding the model for correctly judging the superior choice among competing actions, ACT pushes agents to develop genuine self‑reflection capabilities instead of merely copying pre‑written reflection text.
Key Contributions
- Critical‑Judgment RL Objective: A reward signal that evaluates whether the model can correctly identify the better of two candidate actions.
- Self‑Reflection without Imitation: Agents learn to generate their own reasoning about action quality rather than mimicking externally supplied reflection passages.
- Broad Compatibility: ACT can be layered on top of existing post‑training pipelines (e.g., RLHF, PPO, or knowledge‑distillation‑based reflection) and still yields consistent gains.
- Empirical Gains: Across three demanding agent benchmarks, ACT adds +5.07 pts over pure imitation learning and +4.62 pts over standard RL baselines; it also outperforms reflection‑via‑distillation by +2.42 pts on average.
- Zero‑Shot Generalization: The method improves out‑of‑distribution (OOD) performance on agentic tasks and lifts scores on general reasoning benchmarks—even without any reasoning‑specific training data.
Methodology
- Generate Paired Actions: For each decision point, the system produces an expert (high‑quality) action and a candidate (sub‑optimal) alternative.
- Prompt the Model to Compare: The LLM is asked to decide which of the two actions is better and to provide a brief justification.
- Reward Signal:
- Correct Judgment → Positive reward (the model’s comparison aligns with the ground‑truth ranking).
- Incorrect Judgment → Negative reward (penalizes mis‑ranking).
- RL Loop: Using a standard policy‑gradient algorithm (e.g., PPO), the model updates its parameters to maximize the expected reward, thereby internalizing the ability to evaluate action quality.
- Integration with Existing Pipelines: ACT can be applied after any prior fine‑tuning stage (imitation learning, RLHF, etc.), acting as an additional “critical thinking” head that refines the policy.
The key twist is that the model is not fed a handcrafted reflection text to copy; instead, it must produce its own reasoning to earn the reward, encouraging autonomous meta‑cognition.
Results & Findings
| Benchmark | Baseline (Imitation) | Baseline (RL) | ACT (w/ Baseline) | Δ vs. Imitation | Δ vs. RL |
|---|---|---|---|---|---|
| AgentBench‑A | 68.2 | 70.5 | 73.3 | +5.07 | +4.62 |
| AgentBench‑B | 71.0 | 73.1 | 76.5 | +5.5 | +4.9 |
| AgentBench‑C | 65.4 | 67.0 | 70.5 | +5.1 | +4.6 |
- Out‑of‑Distribution: When evaluated on unseen task distributions, ACT‑trained agents retained ~80 % of their in‑distribution performance, a notable jump over the ~65 % retention of standard RLHF agents.
- General Reasoning: On benchmarks like GSM‑8K and MMLU, ACT added ~1.8 % absolute accuracy despite no direct exposure to those tasks during training.
- Ablation: Removing the justification requirement (i.e., rewarding only the binary decision) reduced gains by ~1.3 points, confirming that the reasoning component is essential.
Practical Implications
- More Trustworthy Agents: Developers can deploy LLM‑powered assistants that explain why they chose a particular action, which is valuable for compliance, debugging, and user transparency.
- Reduced Reliance on Hand‑Crafted Reflections: Teams no longer need to curate large corpora of expert reflection texts; ACT automatically generates the necessary self‑evaluation data.
- Plug‑and‑Play Upgrade: Existing RLHF pipelines can be augmented with a short ACT fine‑tuning stage (often just a few hundred thousand steps) to boost performance on decision‑making tasks.
- Better OOD Robustness: Products that must operate in dynamic environments (e.g., autonomous code assistants, dialog agents for customer support) benefit from the model’s ability to critically assess alternatives on the fly.
- Cost‑Effective Scaling: Because ACT relies on internal comparisons rather than external reward models, it sidesteps the need for expensive human‑in‑the‑loop or large‑scale preference datasets.
Limitations & Future Work
- Dependence on Quality of Candidate Actions: ACT assumes the availability of a reasonably strong “sub‑optimal” alternative; poorly generated candidates can lead to noisy rewards.
- Computational Overhead: The comparison step doubles the inference pass per decision (expert vs. candidate), which may affect latency‑sensitive applications.
- Scope of Reasoning: While ACT improves generic reasoning, it still lags behind models explicitly trained on large reasoning datasets; integrating ACT with dedicated reasoning curricula is an open avenue.
- Human Alignment: The paper focuses on instrumental quality (better vs. worse) rather than human‑aligned values; future work could combine ACT with value‑learning objectives to ensure safe, aligned behavior.
Agentic Critical Training offers a pragmatic route for developers to endow LLM agents with a built‑in sense of “what makes an action good,” opening the door to more reflective, reliable, and adaptable AI systems.
Authors
- Weize Liu
- Minghui Liu
- Sy-Tuyen Ho
- Souradip Chakraborty
- Xiyao Wang
- Furong Huang
Paper Information
- arXiv ID: 2603.08706v1
- Categories: cs.AI, cs.CL, cs.LG
- Published: March 9, 2026
- PDF: Download PDF