[Paper] An Empirical Study of Bugs in Modern LLM Agent Frameworks

Published: 3 days ago (February 25, 2026 at 06:34 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.21806v1

Overview

LLM‑powered agents are becoming the backbone of many AI‑driven products, from autonomous assistants to multi‑bot orchestration platforms. This paper shines a light on a hidden source of fragility: bugs inside the agent frameworks (e.g., CrewAI, LangChain) that glue large language models to external tools and to each other. By analyzing almost a thousand real‑world bug reports, the authors build a taxonomy that helps developers understand why these frameworks fail and how the failures manifest throughout an agent’s lifecycle.

Key Contributions

Large‑scale empirical dataset – 998 bug reports collected from two of the most popular LLM agent frameworks (CrewAI & LangChain).
Lifecycle‑aware taxonomy – 15 root‑cause categories and 7 observable symptoms mapped to five stages of an agent’s life:
1. Agent Initialization
2. Perception
3. Self‑Action
4. Mutual Interaction
5. Evolution
Root‑cause concentration – Identifies API misuse, API incompatibility, and documentation desynchronization as the three dominant sources of bugs, especially during the Self‑Action stage.
Symptom profile – Shows that most bugs surface as Functional Errors, Crashes, or Build Failures, directly breaking task flow or control logic.
Actionable insights – Provides concrete recommendations for framework maintainers, library authors, and downstream developers to reduce bug incidence.

Methodology

Data collection – The authors mined issue trackers, GitHub Discussions, and community forums of CrewAI and LangChain, filtering for reports that were reproducible, had clear descriptions, and were tied to framework code (not the LLM itself).
Manual annotation – A team of three researchers independently labeled each bug with:
- The lifecycle stage where the problem first appeared.
- The root cause (e.g., wrong API usage, version mismatch).
- The observable symptom (e.g., exception, wrong output).
  Disagreements were resolved through discussion, achieving a Cohen’s κ of 0.82 (high inter‑rater reliability).
Taxonomy construction – Using grounded theory, the authors iteratively grouped similar labels, resulting in the final 15‑cause / 7‑symptom schema.
Quantitative analysis – Frequency counts and cross‑tabulations highlighted which causes dominate which stages, and which symptoms are most common overall.

Results & Findings

Lifecycle Stage	Top Root Causes (frequency)	Typical Symptoms
Agent Initialization	Documentation desync, API incompatibility	Build Failure, Import Error
Perception	API misuse, Missing validation	Functional Error, Wrong Data
Self‑Action (largest cluster)	API misuse, API incompatibility, Documentation desync	Functional Error, Crash, Unexpected Return
Mutual Interaction	Concurrency bugs, State leakage	Deadlock, Inconsistent Output
Evolution	Version drift, Configuration drift	Build Failure, Regression

API misuse accounts for ~38 % of all bugs, often because developers call a framework method with arguments that the underlying LLM or tool does not support.
API incompatibility (e.g., mismatched library versions) contributes ~22 %, highlighting the fragile dependency graph of modern agent stacks.
Documentation desync (out‑of‑date docs vs. code) explains ~15 % of failures, underscoring the need for tighter doc‑code coupling.
The majority of bugs (≈70 %) surface as functional errors that silently produce wrong results, which is especially dangerous for production agents that rely on correctness rather than crash detection.

Practical Implications

For framework maintainers:
- Implement stricter type‑checking and runtime validation layers to catch API misuse early.
- Automate compatibility testing across major LLM SDK versions (OpenAI, Anthropic, etc.).
- Adopt “doc‑as‑code” pipelines (e.g., MkDocs with live code snippets) to keep documentation in sync.
For developers building agents:
- Treat framework APIs as first‑class contracts—write unit tests that verify expected signatures and return shapes.
- Pin dependency versions and use tools like pipdeptree to detect transitive incompatibilities.
- Leverage the taxonomy as a checklist when debugging: identify the lifecycle stage, then narrow down likely root causes.
For CI/CD pipelines:
- Add smoke‑tests that simulate the Self‑Action stage (the hot spot) with representative prompts and tool calls.
- Flag any “functional error” patterns (e.g., unexpected JSON structures) as test failures rather than letting them slip into production.
For product managers:
- Allocate budget for “framework health” monitoring—track bug density per lifecycle stage as a leading indicator of technical debt.

Overall, the study equips the community with a roadmap for more reliable LLM agent deployments, moving the focus from model hallucinations to the sturdier, but often overlooked, plumbing that makes agents work.

Limitations & Future Work

Scope limited to two frameworks (CrewAI, LangChain); while they are popular, findings may not fully generalize to niche or emerging agent libraries.
Bug reports are community‑submitted, which can bias the dataset toward more visible or “noisy” failures; silent bugs that never surface in issue trackers remain unexamined.
The taxonomy is static; as new LLM APIs (e.g., function calling, tool‑use extensions) evolve, additional root‑cause categories may emerge.

Future research directions suggested by the authors include: expanding the study to a broader ecosystem of agent frameworks, automating taxonomy extraction via NLP, and building diagnostic tooling that leverages the taxonomy to provide real‑time suggestions during development.

Authors

Xinxue Zhu
Jiacong Wu
Xiaoyu Zhang
Tianlin Li
Yanzhou Mu
Juan Zhai
Chao Shen
Yang Liu

Paper Information

arXiv ID: 2602.21806v1
Categories: cs.SE
Published: February 25, 2026
PDF: Download PDF

[Paper] An Empirical Study of Bugs in Modern LLM Agent Frameworks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Array-Carrying Symbolic Execution for Function Contract Generation

[Paper] LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer

[Paper] CL4SE: A Context Learning Benchmark For Software Engineering Tasks

[Paper] Managing Uncertainty in LLM-based Multi-Agent System Operation