[Paper] Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

Published: 1 day ago (April 27, 2026 at 11:05 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.24579v1

Overview

The paper introduces TraceToChain, a reproducible pipeline that turns execution traces of large‑language‑model (LLM) agents into an absorbing discrete‑time Markov chain (DTMC). By doing so, it unifies disparate reliability metrics (e.g., pass@k, reliability decay curves) into a single, statistically‑grounded “success‑time” distribution, while also providing diagnostics and uncertainty estimates that are missing from current benchmark reporting.

Key Contributions

Trace‑to‑DTMC pipeline that automatically clusters trace states, estimates transition probabilities with Laplace‑smoothed MLE, and fits an absorbing DTMC to LLM agent behavior.
Statistical diagnostics: composite Akaike Information Criterion (AIC) and Kolmogorov–Smirnov (KS) goodness‑of‑fit tests to verify that the chain faithfully represents the observed traces.
Uncertainty quantification: Dirichlet‑posterior credible intervals and non‑parametric bootstrap intervals for every transition probability.
Unified reliability view: Demonstrates that common metrics (pass@k, pass^k, reliability decay curve) are merely projections of a single first‑passage time distribution derived from the DTMC.
Empirical validation: On seven controlled MAST‑style frameworks, the fitted DTMC reproduces held‑out reliability curves with a maximum L∞ error of 0.053 and passes KS tests (p > 0.05) on all frameworks.

Methodology

Trace Collection – Run an LLM agent on a suite of tasks and record every intermediate state (e.g., tool calls, prompts, responses).
Automatic Clustering – Group similar states into a taxonomy of “macro‑states” using a data‑driven clustering algorithm, reducing the trace length while preserving semantics.
Transition Estimation – Count how often the agent moves from one macro‑state to another. Apply Laplace smoothing to avoid zero‑probability edges, then compute maximum‑likelihood estimates for the transition matrix Q (transient‑to‑transient) and the absorbing matrices R₊ (to success) and R₋ (to failure).
Model Fit Checks –
- AIC evaluates model parsimony vs. fit.
- KS test compares the empirical first‑passage time CDF (when the trace first hits an absorbing state) with the analytic CDF derived from the DTMC.
Uncertainty Reporting – Treat transition counts as draws from a Dirichlet distribution to obtain credible intervals; additionally, bootstrap the entire trace set to produce non‑parametric confidence bands.
Reliability Extraction – Use classical reliability formulas (Kemeny–Snell, Goel–Okumoto, etc.) on the fitted DTMC to compute pass@k, pass^k, and the reliability decay curve as closed‑form functions of the first‑passage distribution.

Results & Findings

Fit Quality: Across all seven test frameworks, the analytic reliability decay curves derived from the DTMC overlay the empirical curves with a median L∞ error of 0.048, indicating a tight match.
Statistical Acceptance: Two‑sample KS tests on the first‑passage CDFs never reject the fitted model (p‑values ranging from 0.78 to 1.0).
Uncertainty Tightness: Posterior and bootstrap intervals for each transition probability agree within ~0.01 at the median, showing that the pipeline yields stable estimates even with modest trace data.
Metric Unification: The authors demonstrate mathematically that pass@k, pass^k, and the reliability decay curve are all marginalizations of the same underlying DTMC‑derived distribution, simplifying the interpretation of benchmark results.

Practical Implications

More Trustworthy Benchmarks: Developers can now accompany scalar scores (e.g., pass@10 = 0.73) with a full success‑time distribution and confidence bounds, making it easier to compare agents under different latency or step‑budget constraints.
Debugging & Optimization: The macro‑state transition matrix highlights “bottleneck” states (high probability of looping or failure), guiding targeted prompt engineering or tool‑integration improvements.
Service‑Level Agreements (SLAs): Cloud providers offering LLM‑powered agents can use the DTMC model to predict the probability of task completion within a given time budget, enabling more precise SLA definitions.
Automated Monitoring: By continuously feeding new traces into TraceToChain, production systems can detect drift (e.g., a sudden increase in transition to failure states) before it manifests as user‑visible errors.
Cross‑Task Generalization: Because the pipeline is data‑driven, it can be applied to any sequential LLM workflow—code generation, autonomous web‑browsing, multi‑turn reasoning—without hand‑crafting task‑specific reliability formulas.

Limitations & Future Work

Controlled Benchmarks: Validation was performed on synthetic MAST‑style frameworks; real‑world, noisy environments may introduce state explosion or non‑Markovian dependencies that challenge the DTMC assumption.
State Clustering Sensitivity: The quality of the macro‑state taxonomy depends on the clustering algorithm and hyper‑parameters; poor clustering could obscure important failure modes.
Scalability: For extremely long traces or massive task suites, the transition matrix can become large, potentially requiring sparse‑matrix or hierarchical modeling techniques.
Extension to Continuous Time: The current model is discrete‑time; extending to continuous‑time Markov processes could capture variable‑length actions (e.g., API calls with differing latency).
Integration with Training Loops: Future work could close the loop by feeding reliability diagnostics back into LLM fine‑tuning or reinforcement‑learning‑from‑human‑feedback pipelines, directly optimizing for a desired first‑passage distribution.

Authors

Phat T. Tran‑Truong
Xuan‑Bach Le

Paper Information

arXiv ID: 2604.24579v1
Categories: cs.SE
Published: April 27, 2026
PDF: Download PDF

[Paper] Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Threads to Trajectories: A Multi-LLM Pipeline for Community Knowledge Extraction from GitHub Issue Discussions

[Paper] Does social identity matter in software engineering? Assessing the case of research software engineers

[Paper] Key Developer Roles and Organizational Coupling in Microservices: A Longitudinal Analysis

[Paper] Scenario-based System Testing for Distributed Robotics Applications