[Paper] Agentic Critical Training

Published: 13 hours ago (March 9, 2026 at 01:58 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.08706v1

Overview

The paper introduces Agentic Critical Training (ACT), a new reinforcement‑learning (RL) framework that teaches large language model (LLM) agents not just what to do, but why one action is better than another. By rewarding the model for correctly judging the superior choice among competing actions, ACT pushes agents to develop genuine self‑reflection capabilities instead of merely copying pre‑written reflection text.

Key Contributions

Critical‑Judgment RL Objective: A reward signal that evaluates whether the model can correctly identify the better of two candidate actions.
Self‑Reflection without Imitation: Agents learn to generate their own reasoning about action quality rather than mimicking externally supplied reflection passages.
Broad Compatibility: ACT can be layered on top of existing post‑training pipelines (e.g., RLHF, PPO, or knowledge‑distillation‑based reflection) and still yields consistent gains.
Empirical Gains: Across three demanding agent benchmarks, ACT adds +5.07 pts over pure imitation learning and +4.62 pts over standard RL baselines; it also outperforms reflection‑via‑distillation by +2.42 pts on average.
Zero‑Shot Generalization: The method improves out‑of‑distribution (OOD) performance on agentic tasks and lifts scores on general reasoning benchmarks—even without any reasoning‑specific training data.

Methodology

Generate Paired Actions: For each decision point, the system produces an expert (high‑quality) action and a candidate (sub‑optimal) alternative.
Prompt the Model to Compare: The LLM is asked to decide which of the two actions is better and to provide a brief justification.
Reward Signal:
- Correct Judgment → Positive reward (the model’s comparison aligns with the ground‑truth ranking).
- Incorrect Judgment → Negative reward (penalizes mis‑ranking).
RL Loop: Using a standard policy‑gradient algorithm (e.g., PPO), the model updates its parameters to maximize the expected reward, thereby internalizing the ability to evaluate action quality.
Integration with Existing Pipelines: ACT can be applied after any prior fine‑tuning stage (imitation learning, RLHF, etc.), acting as an additional “critical thinking” head that refines the policy.

The key twist is that the model is not fed a handcrafted reflection text to copy; instead, it must produce its own reasoning to earn the reward, encouraging autonomous meta‑cognition.

Results & Findings

Benchmark	Baseline (Imitation)	Baseline (RL)	ACT (w/ Baseline)	Δ vs. Imitation	Δ vs. RL
AgentBench‑A	68.2	70.5	73.3	+5.07	+4.62
AgentBench‑B	71.0	73.1	76.5	+5.5	+4.9
AgentBench‑C	65.4	67.0	70.5	+5.1	+4.6

Out‑of‑Distribution: When evaluated on unseen task distributions, ACT‑trained agents retained ~80 % of their in‑distribution performance, a notable jump over the ~65 % retention of standard RLHF agents.
General Reasoning: On benchmarks like GSM‑8K and MMLU, ACT added ~1.8 % absolute accuracy despite no direct exposure to those tasks during training.
Ablation: Removing the justification requirement (i.e., rewarding only the binary decision) reduced gains by ~1.3 points, confirming that the reasoning component is essential.

Practical Implications

More Trustworthy Agents: Developers can deploy LLM‑powered assistants that explain why they chose a particular action, which is valuable for compliance, debugging, and user transparency.
Reduced Reliance on Hand‑Crafted Reflections: Teams no longer need to curate large corpora of expert reflection texts; ACT automatically generates the necessary self‑evaluation data.
Plug‑and‑Play Upgrade: Existing RLHF pipelines can be augmented with a short ACT fine‑tuning stage (often just a few hundred thousand steps) to boost performance on decision‑making tasks.
Better OOD Robustness: Products that must operate in dynamic environments (e.g., autonomous code assistants, dialog agents for customer support) benefit from the model’s ability to critically assess alternatives on the fly.
Cost‑Effective Scaling: Because ACT relies on internal comparisons rather than external reward models, it sidesteps the need for expensive human‑in‑the‑loop or large‑scale preference datasets.

Limitations & Future Work

Dependence on Quality of Candidate Actions: ACT assumes the availability of a reasonably strong “sub‑optimal” alternative; poorly generated candidates can lead to noisy rewards.
Computational Overhead: The comparison step doubles the inference pass per decision (expert vs. candidate), which may affect latency‑sensitive applications.
Scope of Reasoning: While ACT improves generic reasoning, it still lags behind models explicitly trained on large reasoning datasets; integrating ACT with dedicated reasoning curricula is an open avenue.
Human Alignment: The paper focuses on instrumental quality (better vs. worse) rather than human‑aligned values; future work could combine ACT with value‑learning objectives to ensure safe, aligned behavior.

Agentic Critical Training offers a pragmatic route for developers to endow LLM agents with a built‑in sense of “what makes an action good,” opening the door to more reflective, reliable, and adaptable AI systems.

Authors

Weize Liu
Minghui Liu
Sy-Tuyen Ho
Souradip Chakraborty
Xiyao Wang
Furong Huang

Paper Information

arXiv ID: 2603.08706v1
Categories: cs.AI, cs.CL, cs.LG
Published: March 9, 2026
PDF: Download PDF

[Paper] Agentic Critical Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] How Far Can Unsupervised RLVR Scale LLM Training?

[Paper] OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

[Paper] Drift-to-Action Controllers: Budgeted Interventions with Online Risk Certificates

[Paper] LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing