[Paper] Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

Published: (April 27, 2026 at 11:05 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.24579v1

Overview

The paper introduces TraceToChain, a reproducible pipeline that turns execution traces of large‑language‑model (LLM) agents into an absorbing discrete‑time Markov chain (DTMC). By doing so, it unifies disparate reliability metrics (e.g., pass@k, reliability decay curves) into a single, statistically‑grounded “success‑time” distribution, while also providing diagnostics and uncertainty estimates that are missing from current benchmark reporting.

Key Contributions

  • Trace‑to‑DTMC pipeline that automatically clusters trace states, estimates transition probabilities with Laplace‑smoothed MLE, and fits an absorbing DTMC to LLM agent behavior.
  • Statistical diagnostics: composite Akaike Information Criterion (AIC) and Kolmogorov–Smirnov (KS) goodness‑of‑fit tests to verify that the chain faithfully represents the observed traces.
  • Uncertainty quantification: Dirichlet‑posterior credible intervals and non‑parametric bootstrap intervals for every transition probability.
  • Unified reliability view: Demonstrates that common metrics (pass@k, pass^k, reliability decay curve) are merely projections of a single first‑passage time distribution derived from the DTMC.
  • Empirical validation: On seven controlled MAST‑style frameworks, the fitted DTMC reproduces held‑out reliability curves with a maximum L∞ error of 0.053 and passes KS tests (p > 0.05) on all frameworks.

Methodology

  1. Trace Collection – Run an LLM agent on a suite of tasks and record every intermediate state (e.g., tool calls, prompts, responses).
  2. Automatic Clustering – Group similar states into a taxonomy of “macro‑states” using a data‑driven clustering algorithm, reducing the trace length while preserving semantics.
  3. Transition Estimation – Count how often the agent moves from one macro‑state to another. Apply Laplace smoothing to avoid zero‑probability edges, then compute maximum‑likelihood estimates for the transition matrix Q (transient‑to‑transient) and the absorbing matrices R₊ (to success) and R₋ (to failure).
  4. Model Fit Checks
    • AIC evaluates model parsimony vs. fit.
    • KS test compares the empirical first‑passage time CDF (when the trace first hits an absorbing state) with the analytic CDF derived from the DTMC.
  5. Uncertainty Reporting – Treat transition counts as draws from a Dirichlet distribution to obtain credible intervals; additionally, bootstrap the entire trace set to produce non‑parametric confidence bands.
  6. Reliability Extraction – Use classical reliability formulas (Kemeny–Snell, Goel–Okumoto, etc.) on the fitted DTMC to compute pass@k, pass^k, and the reliability decay curve as closed‑form functions of the first‑passage distribution.

Results & Findings

  • Fit Quality: Across all seven test frameworks, the analytic reliability decay curves derived from the DTMC overlay the empirical curves with a median L∞ error of 0.048, indicating a tight match.
  • Statistical Acceptance: Two‑sample KS tests on the first‑passage CDFs never reject the fitted model (p‑values ranging from 0.78 to 1.0).
  • Uncertainty Tightness: Posterior and bootstrap intervals for each transition probability agree within ~0.01 at the median, showing that the pipeline yields stable estimates even with modest trace data.
  • Metric Unification: The authors demonstrate mathematically that pass@k, pass^k, and the reliability decay curve are all marginalizations of the same underlying DTMC‑derived distribution, simplifying the interpretation of benchmark results.

Practical Implications

  • More Trustworthy Benchmarks: Developers can now accompany scalar scores (e.g., pass@10 = 0.73) with a full success‑time distribution and confidence bounds, making it easier to compare agents under different latency or step‑budget constraints.
  • Debugging & Optimization: The macro‑state transition matrix highlights “bottleneck” states (high probability of looping or failure), guiding targeted prompt engineering or tool‑integration improvements.
  • Service‑Level Agreements (SLAs): Cloud providers offering LLM‑powered agents can use the DTMC model to predict the probability of task completion within a given time budget, enabling more precise SLA definitions.
  • Automated Monitoring: By continuously feeding new traces into TraceToChain, production systems can detect drift (e.g., a sudden increase in transition to failure states) before it manifests as user‑visible errors.
  • Cross‑Task Generalization: Because the pipeline is data‑driven, it can be applied to any sequential LLM workflow—code generation, autonomous web‑browsing, multi‑turn reasoning—without hand‑crafting task‑specific reliability formulas.

Limitations & Future Work

  • Controlled Benchmarks: Validation was performed on synthetic MAST‑style frameworks; real‑world, noisy environments may introduce state explosion or non‑Markovian dependencies that challenge the DTMC assumption.
  • State Clustering Sensitivity: The quality of the macro‑state taxonomy depends on the clustering algorithm and hyper‑parameters; poor clustering could obscure important failure modes.
  • Scalability: For extremely long traces or massive task suites, the transition matrix can become large, potentially requiring sparse‑matrix or hierarchical modeling techniques.
  • Extension to Continuous Time: The current model is discrete‑time; extending to continuous‑time Markov processes could capture variable‑length actions (e.g., API calls with differing latency).
  • Integration with Training Loops: Future work could close the loop by feeding reliability diagnostics back into LLM fine‑tuning or reinforcement‑learning‑from‑human‑feedback pipelines, directly optimizing for a desired first‑passage distribution.

Authors

  • Phat T. Tran‑Truong
  • Xuan‑Bach Le

Paper Information

  • arXiv ID: 2604.24579v1
  • Categories: cs.SE
  • Published: April 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »