[Paper] Towards a Science of AI Agent Reliability
Source: arXiv - 2602.16666v1
Overview
The paper Towards a Science of AI Agent Reliability tackles a glaring gap in how we evaluate modern AI agents. While benchmark scores keep climbing, real‑world deployments still see frequent, sometimes catastrophic failures. The authors argue that a single “success rate” metric hides crucial reliability issues and propose a systematic, engineering‑inspired framework to measure how agents behave—not just whether they succeed.
Key Contributions
- Reliability Taxonomy: Defines four core dimensions—Consistency, Robustness, Predictability, and Safety—that together capture an agent’s operational health.
- Twelve Concrete Metrics: Provides specific, computable measures for each dimension (e.g., variance across runs, sensitivity to input perturbations, failure‑mode entropy, bounded error severity).
- Benchmark‑Level Evaluation Suite: Implements the metrics on 14 state‑of‑the‑art agentic models across two widely used benchmarks, offering the first large‑scale reliability comparison.
- Empirical Insight: Shows that recent gains in raw capability translate into only modest reliability improvements, highlighting persistent weaknesses.
- Open‑Source Toolkit: Releases code and evaluation scripts so practitioners can readily apply the reliability profile to their own agents.
Methodology
- Define Reliability Axes – The authors draw from safety‑critical engineering (e.g., aerospace, medical devices) to formalize four axes:
- Consistency: Does the agent produce the same output given identical inputs across multiple runs?
- Robustness: How does performance degrade under controlled perturbations (noise, adversarial edits, distribution shift)?
- Predictability: Can we anticipate when and how the agent will fail (e.g., failure‑mode clustering, confidence calibration)?
- Safety: Are errors bounded in severity, and do they avoid catastrophic outcomes?
- Metric Construction – For each axis, they design one or more quantitative metrics. For example, consistency is measured by pairwise output similarity across seeds; robustness uses performance curves over increasing perturbation magnitude.
- Experimental Setup – They select 14 agents (including large language models and reinforcement‑learning policies) and evaluate them on two complementary benchmarks: a text‑based instruction‑following suite and a simulated navigation task. Each agent is run multiple times per task, with systematic perturbations applied.
- Analysis Pipeline – The metrics are aggregated into a reliability profile per model, visualized as radar charts and heatmaps to expose trade‑offs.
Results & Findings
- Small Reliability Gains: The newest models (e.g., GPT‑4‑style) improve raw success rates by ~10‑15% over older baselines, but their reliability scores (especially robustness and safety) improve by less than 5%.
- Consistency vs. Capability Trade‑off: Some high‑performing agents exhibit higher output variance, suggesting that scaling up model size can hurt repeatability.
- Robustness Gaps: Across all agents, performance drops sharply with modest input noise (e.g., 5% token perturbation leads to >20% success loss).
- Predictability Deficits: Failure modes are highly dispersed; confidence scores are poorly calibrated, making it hard to predict when an agent will err.
- Safety Concerns: Certain agents produce unbounded erroneous outputs (e.g., hallucinated instructions) that could be dangerous in downstream pipelines.
Practical Implications
- Developer Tooling: The released metric suite can be integrated into CI pipelines to flag reliability regressions before deployment.
- Model Selection: Teams can now weigh raw accuracy against reliability dimensions, choosing models that meet safety thresholds for high‑stakes applications (e.g., medical triage, autonomous navigation).
- Fine‑Tuning Strategies: The findings suggest that targeted robustness fine‑tuning (e.g., adversarial data augmentation) may be more effective than chasing higher benchmark scores alone.
- Risk Management: By quantifying error severity, product owners can design fallback mechanisms (human‑in‑the‑loop, circuit breakers) that trigger when safety metrics cross predefined limits.
- Regulatory Readiness: A standardized reliability profile aligns with emerging AI governance frameworks that require demonstrable safety and robustness evidence.
Limitations & Future Work
- Benchmark Coverage: The study focuses on two benchmarks; broader domain coverage (e.g., vision, multimodal agents) is needed to generalize the reliability taxonomy.
- Metric Sensitivity: Some metrics (e.g., perturbation thresholds) are heuristic and may need calibration for specific deployment contexts.
- Scalability: Computing all twelve metrics for very large models can be resource‑intensive; future work could explore surrogate or sampling‑based estimators.
- Human Factors: The paper does not address how end‑users interpret reliability scores or how these metrics interact with user trust.
- Dynamic Environments: Extending the framework to continual‑learning or online‑adaptation scenarios remains an open challenge.
Bottom line: This work provides the first systematic, engineering‑grade toolbox for measuring AI agent reliability. For developers building mission‑critical systems, it offers a concrete way to move beyond “does it work on average?” toward “does it work safely and predictably in the real world?”
Authors
- Stephan Rabanser
- Sayash Kapoor
- Peter Kirgis
- Kangheng Liu
- Saiteja Utpala
- Arvind Narayanan
Paper Information
- arXiv ID: 2602.16666v1
- Categories: cs.AI, cs.CY, cs.LG
- Published: February 18, 2026
- PDF: Download PDF