[Paper] Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents
Source: arXiv - 2604.24579v1
Overview
The paper introduces TraceToChain, a reproducible pipeline that turns execution traces of large‑language‑model (LLM) agents into an absorbing discrete‑time Markov chain (DTMC). By doing so, it unifies disparate reliability metrics (e.g., pass@k, reliability decay curves) into a single, statistically‑grounded “success‑time” distribution, while also providing diagnostics and uncertainty estimates that are missing from current benchmark reporting.
Key Contributions
- Trace‑to‑DTMC pipeline that automatically clusters trace states, estimates transition probabilities with Laplace‑smoothed MLE, and fits an absorbing DTMC to LLM agent behavior.
- Statistical diagnostics: composite Akaike Information Criterion (AIC) and Kolmogorov–Smirnov (KS) goodness‑of‑fit tests to verify that the chain faithfully represents the observed traces.
- Uncertainty quantification: Dirichlet‑posterior credible intervals and non‑parametric bootstrap intervals for every transition probability.
- Unified reliability view: Demonstrates that common metrics (pass@k, pass^k, reliability decay curve) are merely projections of a single first‑passage time distribution derived from the DTMC.
- Empirical validation: On seven controlled MAST‑style frameworks, the fitted DTMC reproduces held‑out reliability curves with a maximum L∞ error of 0.053 and passes KS tests (p > 0.05) on all frameworks.
Methodology
- Trace Collection – Run an LLM agent on a suite of tasks and record every intermediate state (e.g., tool calls, prompts, responses).
- Automatic Clustering – Group similar states into a taxonomy of “macro‑states” using a data‑driven clustering algorithm, reducing the trace length while preserving semantics.
- Transition Estimation – Count how often the agent moves from one macro‑state to another. Apply Laplace smoothing to avoid zero‑probability edges, then compute maximum‑likelihood estimates for the transition matrix Q (transient‑to‑transient) and the absorbing matrices R₊ (to success) and R₋ (to failure).
- Model Fit Checks –
- AIC evaluates model parsimony vs. fit.
- KS test compares the empirical first‑passage time CDF (when the trace first hits an absorbing state) with the analytic CDF derived from the DTMC.
- Uncertainty Reporting – Treat transition counts as draws from a Dirichlet distribution to obtain credible intervals; additionally, bootstrap the entire trace set to produce non‑parametric confidence bands.
- Reliability Extraction – Use classical reliability formulas (Kemeny–Snell, Goel–Okumoto, etc.) on the fitted DTMC to compute pass@k, pass^k, and the reliability decay curve as closed‑form functions of the first‑passage distribution.
Results & Findings
- Fit Quality: Across all seven test frameworks, the analytic reliability decay curves derived from the DTMC overlay the empirical curves with a median L∞ error of 0.048, indicating a tight match.
- Statistical Acceptance: Two‑sample KS tests on the first‑passage CDFs never reject the fitted model (p‑values ranging from 0.78 to 1.0).
- Uncertainty Tightness: Posterior and bootstrap intervals for each transition probability agree within ~0.01 at the median, showing that the pipeline yields stable estimates even with modest trace data.
- Metric Unification: The authors demonstrate mathematically that pass@k, pass^k, and the reliability decay curve are all marginalizations of the same underlying DTMC‑derived distribution, simplifying the interpretation of benchmark results.
Practical Implications
- More Trustworthy Benchmarks: Developers can now accompany scalar scores (e.g., pass@10 = 0.73) with a full success‑time distribution and confidence bounds, making it easier to compare agents under different latency or step‑budget constraints.
- Debugging & Optimization: The macro‑state transition matrix highlights “bottleneck” states (high probability of looping or failure), guiding targeted prompt engineering or tool‑integration improvements.
- Service‑Level Agreements (SLAs): Cloud providers offering LLM‑powered agents can use the DTMC model to predict the probability of task completion within a given time budget, enabling more precise SLA definitions.
- Automated Monitoring: By continuously feeding new traces into TraceToChain, production systems can detect drift (e.g., a sudden increase in transition to failure states) before it manifests as user‑visible errors.
- Cross‑Task Generalization: Because the pipeline is data‑driven, it can be applied to any sequential LLM workflow—code generation, autonomous web‑browsing, multi‑turn reasoning—without hand‑crafting task‑specific reliability formulas.
Limitations & Future Work
- Controlled Benchmarks: Validation was performed on synthetic MAST‑style frameworks; real‑world, noisy environments may introduce state explosion or non‑Markovian dependencies that challenge the DTMC assumption.
- State Clustering Sensitivity: The quality of the macro‑state taxonomy depends on the clustering algorithm and hyper‑parameters; poor clustering could obscure important failure modes.
- Scalability: For extremely long traces or massive task suites, the transition matrix can become large, potentially requiring sparse‑matrix or hierarchical modeling techniques.
- Extension to Continuous Time: The current model is discrete‑time; extending to continuous‑time Markov processes could capture variable‑length actions (e.g., API calls with differing latency).
- Integration with Training Loops: Future work could close the loop by feeding reliability diagnostics back into LLM fine‑tuning or reinforcement‑learning‑from‑human‑feedback pipelines, directly optimizing for a desired first‑passage distribution.
Authors
- Phat T. Tran‑Truong
- Xuan‑Bach Le
Paper Information
- arXiv ID: 2604.24579v1
- Categories: cs.SE
- Published: April 27, 2026
- PDF: Download PDF