[Paper] From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

Published: (May 7, 2026 at 10:39 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.06365v1

Overview

Large language model (LLM) agents are increasingly being used as autonomous “workers” that reason, call tools, store memory, and iteratively refine their outputs. While these loops can produce impressive answers, the implicit conversational state they rely on makes it hard to keep work reproducible, isolate unrelated changes, or reliably propagate updates. The paper “From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI‑Native Work” proposes a new execution model—execution lineage—that represents an agent’s entire workflow as a directed‑acyclic graph (DAG) of artifact‑producing steps, each with explicit dependencies and identity‑based replay. The authors show that this graph‑based approach yields far more stable and maintainable results than traditional loop‑centric updates.

Key Contributions

  • Execution Lineage Model: Formalizes AI‑native work as a DAG of deterministic computations, exposing explicit data dependencies and stable intermediate artifacts.
  • Identity‑Based Replay: Introduces a replay mechanism that re‑executes only the affected nodes when a change occurs, preserving unrelated work unchanged.
  • Empirical Evaluation: Benchmarks DAG replay against two loop‑centric baselines on controlled policy‑memo update tasks, demonstrating zero churn and perfect upstream/downstream preservation.
  • State‑Quality vs. Answer‑Quality Insight: Shows that high‑quality final answers can mask hidden inconsistencies in the underlying state, which DAG replay eliminates.
  • Practical Blueprint: Provides design patterns and implementation hints for integrating execution lineage into existing LLM‑agent frameworks.

Methodology

  1. Workflow Graph Construction – The authors instrument a typical LLM‑agent loop (reason → tool → memory → refine) to emit artifact nodes (e.g., a generated policy draft, a tool‑call result). Each node records a unique identifier and a list of input identifiers, forming a DAG.
  2. Deterministic Execution – Nodes are executed in a pure, side‑effect‑free manner; any nondeterminism (e.g., temperature sampling) is either fixed or captured as part of the node’s state.
  3. Replay Engine – When a user edits an intermediate artifact (e.g., adds a new constraint), the engine recomputes only the downstream nodes that depend on the edited node, re‑using cached results for all others.
  4. Baseline Comparisons – Two loop‑centric baselines are implemented: (a) full regeneration (re‑run the entire agent from scratch) and (b) partial regeneration (re‑run from the point of edit but without explicit dependency tracking).
  5. Metrics – The study measures churn (how many artifacts change unintentionally), contamination (import of unrelated context), and cross‑artifact consistency (whether related artifacts stay aligned after an edit).

Results & Findings

ScenarioDAG ReplayFull RegenerationPartial Regeneration
Unrelated‑branch update (edit a memo unrelated to current branch)0% churn, 0% contamination – final memo unchanged78% of runs imported unrelated context45% of runs imported unrelated context
Intermediate‑artifact edit (add a new policy constraint)All downstream artifacts updated exactly, upstream artifacts unchanged, perfect consistencyUpdated final memo but also altered unrelated upstream artifactsUpdated final memo but introduced occasional mismatches between related artifacts
Overall answer qualityComparable to baselines on the first pass; superior on subsequent revisions due to stable stateSlightly higher on first pass when all context fits in promptSimilar to DAG on first pass, degrades on later revisions

Takeaway: DAG‑based execution lineage guarantees that only the intended parts of a workflow change, eliminating hidden state drift that can accumulate over iterative revisions. While strong loop baselines can still produce polished final outputs for single‑shot tasks, they lack the reproducibility guarantees that matter for long‑running AI‑native projects.

Practical Implications

  • Version‑Controlled AI Workflows: Developers can treat each artifact as a commit in a version‑control system, enabling diffing, rollbacks, and collaborative editing of LLM‑generated content.
  • Tool Integration Pipelines: When LLM agents orchestrate external APIs (e.g., code generation → compilation → testing), execution lineage ensures that a change to the test suite only re‑runs the relevant compilation step, saving compute and reducing latency.
  • Regulatory & Auditing Needs: Industries that require traceability (finance, healthcare, legal) can now provide a deterministic provenance graph for every AI‑produced decision, satisfying compliance audits.
  • Continuous Improvement Loops: Teams can safely experiment with new prompts, model upgrades, or constraint additions without fearing that unrelated artifacts will be unintentionally altered.
  • Debugging & Explainability: The DAG makes it trivial to pinpoint which node introduced a bug or an undesirable bias, because each output is tied to a specific input set and model invocation.

Limitations & Future Work

  • Determinism Assumption: The approach relies on fixing randomness (e.g., temperature = 0) or capturing stochastic seeds; truly nondeterministic models may still produce divergent artifacts.
  • Scalability of Graph Size: Very large agent workflows could generate massive DAGs; the paper notes the need for pruning, summarization, or hierarchical graph abstractions.
  • Integration Overhead: Existing LLM‑agent frameworks require substantial instrumentation to emit artifact nodes and manage identifiers, which may be a barrier for rapid prototyping.
  • Generalization Beyond Policy‑Memo Tasks: The evaluation focuses on controlled policy‑memo updates; broader domains (e.g., multi‑modal generation, long‑form writing) remain to be tested.
  • Future Directions: The authors suggest exploring hybrid models that combine DAG lineage with selective loop execution, automatic dependency inference, and tooling for visualizing execution graphs in IDEs.

Authors

  • Josh Rosen
  • Seth Rosen

Paper Information

  • arXiv ID: 2605.06365v1
  • Categories: cs.AI, cs.MA, cs.SE
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...