[Paper] How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

Published: 1 week ago (December 8, 2025 at 07:27 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07497v1

Overview

The paper How Do LLMs Fail In Agentic Scenarios? dives into the nitty‑gritty of why large language models (LLMs) stumble when they’re asked to act autonomously—think “AI assistants that can read files, run SQL queries, or manipulate spreadsheets on their own.” By dissecting 900 execution traces across three popular models, the authors expose the hidden patterns that separate a smooth, reliable AI agent from one that constantly trips over its own instructions.

Key Contributions

Introduces the Kamiwaza Agentic Merit Index (KAMI) v0.1, a benchmark that records step‑by‑step traces rather than just final scores, enabling fine‑grained failure analysis.
Compares three representative LLMs (Granite 4 Small, Llama 4 Maverick, DeepSeek V3.1) across four realistic tool‑use tasks: filesystem navigation, text extraction, CSV analysis, and SQL querying.
Identifies four recurring failure archetypes that appear regardless of model size or architecture.
Shows that scale alone isn’t enough: a 400 B‑parameter model only marginally outperforms a 32 B‑parameter model on uncertainty‑driven tasks, while reinforcement‑learning fine‑tuning drives most of DeepSeek’s reliability.
Proposes concrete evaluation dimensions (interactive grounding, recovery behavior, environment‑aware adaptation) for future agentic benchmarks.

Methodology

Benchmark Design (KAMI v0.1) – The authors built a suite of simulated “agentic” tasks that require multiple tool calls (e.g., opening a file, parsing its content, feeding results into a SQL query). Each trial logs every model decision, tool invocation, and response.
Model Selection – Three open‑source LLMs were chosen to represent a spread of scale and training regimes:
- Granite 4 Small (≈32 B parameters)
- Llama 4 Maverick (≈400 B parameters)
- DeepSeek V3.1 (≈70 B parameters, RL‑fine‑tuned)
Task Scenarios – Four domains were covered:
- Filesystem – locate, read, and modify files.
- Text Extraction – pull specific snippets from unstructured documents.
- CSV Analysis – compute aggregates, filter rows, join tables.
- SQL – generate and run queries against a mock database.
Trace Analysis – Rather than aggregating a single accuracy number, the authors manually inspected each trace, categorizing successes and failures into behavioral patterns.
Failure Archetype Coding – Four failure types were defined a‑priori and then refined iteratively as new patterns emerged.

Results & Findings

Model	Overall Success Rate*	Notable Strengths	Main Weaknesses
Granite 4 Small	~58 %	Handles deterministic file reads well	Struggles with ambiguous prompts; frequent “premature action”
Llama 4 Maverick	~62 %	Slightly better at uncertainty handling	Still prone to “over‑helpfulness” and context pollution
DeepSeek V3.1	~78 %	Robust recovery, fewer distractor errors	Occasionally brittle under heavy tool‑call load

*Success = completing the task within the allowed number of tool calls and producing a correct final answer.

Four failure archetypes uncovered

Premature Action Without Grounding – The model issues a tool call before it has verified the necessary context (e.g., querying a DB before confirming the table name exists).
Over‑Helpfulness – The agent fabricates missing entities (e.g., inventing a column name) to keep the conversation flowing, leading to silent logical errors.
Distractor‑Induced Context Pollution – Irrelevant information in the prompt or previous steps contaminates the model’s reasoning, causing it to chase dead‑ends.
Fragile Execution Under Load – When the number of required tool calls exceeds a modest threshold, the model’s internal state degrades, resulting in dropped calls or malformed commands.

Crisply, model size did not guarantee resilience; DeepSeek’s RL‑based post‑training gave it a decisive edge, suggesting that targeted fine‑tuning is more valuable than raw parameter count for agentic reliability.

Practical Implications

Enterprise AI Assistants – Companies building internal bots (e.g., for data retrieval or report generation) should prioritize reinforcement‑learning fine‑tuning and explicit verification steps over simply scaling up the base model.
Tool‑Use SDKs – SDK designers can embed “guardrails” (e.g., schema validation before a SQL call) to catch premature actions early, reducing the impact of the first failure archetype.
Prompt Engineering – Structuring prompts to isolate relevant context and explicitly request confirmation before tool calls can mitigate over‑helpfulness and distractor pollution.
Monitoring & Recovery – Deployments should log full execution traces (as KAMI does) and implement automated rollback or retry mechanisms when a trace shows signs of the “fragile execution” pattern.
Benchmarking Standards – The community can adopt KAMI‑style trace‑level evaluation to surface hidden bugs before shipping LLM‑powered agents to production.

Limitations & Future Work

Synthetic Environment – The benchmark runs in a controlled simulation; real‑world systems may introduce network latency, permission errors, or richer data modalities that were not tested.
Model Diversity – Only three models were examined; extending the study to newer open‑source and closed‑source LLMs (e.g., GPT‑4o, Claude) would validate whether the identified archetypes generalize.
Automated Failure Classification – Current analysis relies on manual trace inspection; future work could train a meta‑model to automatically flag the four failure types at scale.
User‑In‑the‑Loop Scenarios – The study assumes fully autonomous agents; incorporating intermittent human feedback could reveal additional robustness strategies.

By shining a light on how LLMs stumble rather than just how well they score, this research offers a roadmap for building truly dependable AI agents that can be trusted in everyday developer workflows and enterprise pipelines.

Authors

JV Roig

Paper Information

arXiv ID: 2512.07497v1
Categories: cs.AI, cs.SE
Published: December 8, 2025
PDF: Download PDF

[Paper] How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

Overview

Key Contributions

Methodology

Results & Findings

Four failure archetypes uncovered

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

[Paper] Feedforward 3D Editing via Text-Steerable Image-to-3D

[Paper] Directional Textual Inversion for Personalized Text-to-Image Generation

[Paper] A Scientific Reasoning Model for Organic Synthesis Procedure Generation