[Paper] Towards a Science of AI Agent Reliability

Published: 2 months ago (February 18, 2026 at 01:05 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.16666v1

Overview

The paper Towards a Science of AI Agent Reliability tackles a glaring gap in how we evaluate modern AI agents. While benchmark scores keep climbing, real‑world deployments still see frequent, sometimes catastrophic failures. The authors argue that a single “success rate” metric hides crucial reliability issues and propose a systematic, engineering‑inspired framework to measure how agents behave—not just whether they succeed.

Key Contributions

Reliability Taxonomy: Defines four core dimensions—Consistency, Robustness, Predictability, and Safety—that together capture an agent’s operational health.
Twelve Concrete Metrics: Provides specific, computable measures for each dimension (e.g., variance across runs, sensitivity to input perturbations, failure‑mode entropy, bounded error severity).
Benchmark‑Level Evaluation Suite: Implements the metrics on 14 state‑of‑the‑art agentic models across two widely used benchmarks, offering the first large‑scale reliability comparison.
Empirical Insight: Shows that recent gains in raw capability translate into only modest reliability improvements, highlighting persistent weaknesses.
Open‑Source Toolkit: Releases code and evaluation scripts so practitioners can readily apply the reliability profile to their own agents.

Methodology

Define Reliability Axes – The authors draw from safety‑critical engineering (e.g., aerospace, medical devices) to formalize four axes:
- Consistency: Does the agent produce the same output given identical inputs across multiple runs?
- Robustness: How does performance degrade under controlled perturbations (noise, adversarial edits, distribution shift)?
- Predictability: Can we anticipate when and how the agent will fail (e.g., failure‑mode clustering, confidence calibration)?
- Safety: Are errors bounded in severity, and do they avoid catastrophic outcomes?
Metric Construction – For each axis, they design one or more quantitative metrics. For example, consistency is measured by pairwise output similarity across seeds; robustness uses performance curves over increasing perturbation magnitude.
Experimental Setup – They select 14 agents (including large language models and reinforcement‑learning policies) and evaluate them on two complementary benchmarks: a text‑based instruction‑following suite and a simulated navigation task. Each agent is run multiple times per task, with systematic perturbations applied.
Analysis Pipeline – The metrics are aggregated into a reliability profile per model, visualized as radar charts and heatmaps to expose trade‑offs.

Results & Findings

Small Reliability Gains: The newest models (e.g., GPT‑4‑style) improve raw success rates by ~10‑15% over older baselines, but their reliability scores (especially robustness and safety) improve by less than 5%.
Consistency vs. Capability Trade‑off: Some high‑performing agents exhibit higher output variance, suggesting that scaling up model size can hurt repeatability.
Robustness Gaps: Across all agents, performance drops sharply with modest input noise (e.g., 5% token perturbation leads to >20% success loss).
Predictability Deficits: Failure modes are highly dispersed; confidence scores are poorly calibrated, making it hard to predict when an agent will err.
Safety Concerns: Certain agents produce unbounded erroneous outputs (e.g., hallucinated instructions) that could be dangerous in downstream pipelines.

Practical Implications

Developer Tooling: The released metric suite can be integrated into CI pipelines to flag reliability regressions before deployment.
Model Selection: Teams can now weigh raw accuracy against reliability dimensions, choosing models that meet safety thresholds for high‑stakes applications (e.g., medical triage, autonomous navigation).
Fine‑Tuning Strategies: The findings suggest that targeted robustness fine‑tuning (e.g., adversarial data augmentation) may be more effective than chasing higher benchmark scores alone.
Risk Management: By quantifying error severity, product owners can design fallback mechanisms (human‑in‑the‑loop, circuit breakers) that trigger when safety metrics cross predefined limits.
Regulatory Readiness: A standardized reliability profile aligns with emerging AI governance frameworks that require demonstrable safety and robustness evidence.

Limitations & Future Work

Benchmark Coverage: The study focuses on two benchmarks; broader domain coverage (e.g., vision, multimodal agents) is needed to generalize the reliability taxonomy.
Metric Sensitivity: Some metrics (e.g., perturbation thresholds) are heuristic and may need calibration for specific deployment contexts.
Scalability: Computing all twelve metrics for very large models can be resource‑intensive; future work could explore surrogate or sampling‑based estimators.
Human Factors: The paper does not address how end‑users interpret reliability scores or how these metrics interact with user trust.
Dynamic Environments: Extending the framework to continual‑learning or online‑adaptation scenarios remains an open challenge.

Bottom line: This work provides the first systematic, engineering‑grade toolbox for measuring AI agent reliability. For developers building mission‑critical systems, it offers a concrete way to move beyond “does it work on average?” toward “does it work safely and predictably in the real world?”

Authors

Stephan Rabanser
Sayash Kapoor
Peter Kirgis
Kangheng Liu
Saiteja Utpala
Arvind Narayanan

Paper Information

arXiv ID: 2602.16666v1
Categories: cs.AI, cs.CY, cs.LG
Published: February 18, 2026
PDF: Download PDF

[Paper] Towards a Science of AI Agent Reliability

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

[Paper] Unifying approach to uniform expressivity of graph neural networks

[Paper] Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges