[Paper] AgentStepper: Interactive Debugging of Software Development Agents
Source: arXiv - 2602.06593v1
Overview
The paper presents AgentStepper, the first interactive debugger designed specifically for software‑development agents powered by large language models (LLMs). By turning the opaque “black‑box” execution of these agents into a step‑wise, inspectable process, the authors make it possible for developers to understand, control, and fix the agents that automate tasks such as environment setup, issue triage, and program repair.
Key Contributions
- Interactive debugging paradigm for LLM agents – treats an agent’s execution as a structured conversation among the LLM, the agent controller, and external tools.
- Breakpoint, step‑over, and live‑edit capabilities – developers can pause execution, examine intermediate states, and modify prompts or tool calls on the fly.
- Repository‑level change tracking – every code edit performed by the agent is captured and visualized, bridging the gap between high‑level intent and low‑level diffs.
- Low integration overhead – adding AgentStepper to three state‑of‑the‑art agents required only 39–42 lines of edited code.
- Empirical validation – a user study (12 participants) shows higher trajectory comprehension, better bug‑finding success, and markedly lower frustration compared with conventional debugging tools.
Methodology
- Trajectory Modeling – The authors model an agent’s run as a sequence of messages: (a) the LLM’s generated response, (b) the agent program’s decision logic, and (c) any tool invocations (e.g., file system, compiler). This representation is stored as a directed graph that can be visualized step‑by‑step.
- Debugger Engine – Built on top of a lightweight runtime wrapper that intercepts every LLM query and tool call. The wrapper injects breakpoints (user‑defined or automatic) and exposes an API for stepping, inspecting variables, and editing prompts.
- Live Editing – When a breakpoint hits, developers can edit the prompt sent to the LLM or the arguments of a tool call. The modified message is re‑sent, and the subsequent trajectory is recomputed without restarting the whole agent.
- Evaluation –
- Integration test: The authors retro‑fitted AgentStepper into ExecutionAgent, SWE‑Agent, and RepairAgent, measuring lines of code changed and runtime overhead.
- User study: Twelve participants (software engineers) performed two tasks—understanding a given trajectory and locating a bug in the agent’s implementation—using either AgentStepper or a standard console logger. Performance, success rate, and NASA‑TLX workload scores were collected.
Results & Findings
| Metric | AgentStepper | Baseline |
|---|---|---|
| Trajectory comprehension accuracy | 64 % | 67 % (mean) |
| Bug‑identification success | 60 % | 17 % |
| Frustration (TLX, 0–7) | 2.4 | 5.4 |
| Lines of code added to agents | 39–42 | — |
| Runtime overhead | < 10 % on average | — |
The numbers indicate that while raw comprehension scores are comparable, AgentStepper dramatically improves debugging success and cuts perceived workload in half. The modest code changes required for integration suggest the approach is practical for existing agents.
Practical Implications
- Faster iteration on agent development – Teams can pinpoint why an LLM generated an unexpected suggestion (e.g., a malformed prompt) without re‑running the whole pipeline.
- Higher reliability for production‑grade agents – By exposing intermediate repository changes, developers can enforce code‑review style checks before the agent commits modifications.
- Tool‑chain extensibility – The breakpoint and live‑edit hooks can be adapted to custom tools (e.g., container orchestration, CI pipelines), making AgentStepper a generic “debug layer” for any LLM‑driven automation.
- Education and onboarding – New engineers can learn how sophisticated agents reason by stepping through real executions, lowering the barrier to adopt LLM‑based development assistants.
- Potential integration with IDEs – Because AgentStepper already visualizes code diffs and conversational steps, plugging it into VS Code or JetBrains IDEs could give developers a familiar debugging UI for AI agents.
Limitations & Future Work
- Scalability to long‑running agents – The current prototype stores the full trajectory in memory, which may become prohibitive for agents that run for hours or generate thousands of steps.
- Dependence on deterministic LLM responses – Live editing can cause nondeterministic behavior if the underlying model changes temperature or sampling, complicating reproducibility.
- User study size – Only twelve participants were evaluated; larger, more diverse studies are needed to confirm generalizability across domains (e.g., data‑science notebooks, DevOps scripts).
- Tool ecosystem coverage – The debugger currently supports a limited set of built‑in tools; extending it to arbitrary third‑party APIs will require a standardized instrumentation interface.
Future work could address these points by adding trajectory compression, deterministic replay mechanisms, broader user evaluations, and an open plugin system for tool instrumentation.
Authors
- Robert Hutter
- Michael Pradel
Paper Information
- arXiv ID: 2602.06593v1
- Categories: cs.SE, cs.AI
- Published: February 6, 2026
- PDF: Download PDF