[Paper] How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

Published: (December 8, 2025 at 07:27 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.07497v1

Overview

The paper How Do LLMs Fail In Agentic Scenarios? dives into the nitty‑gritty of why large language models (LLMs) stumble when they’re asked to act autonomously—think “AI assistants that can read files, run SQL queries, or manipulate spreadsheets on their own.” By dissecting 900 execution traces across three popular models, the authors expose the hidden patterns that separate a smooth, reliable AI agent from one that constantly trips over its own instructions.

Key Contributions

  • Introduces the Kamiwaza Agentic Merit Index (KAMI) v0.1, a benchmark that records step‑by‑step traces rather than just final scores, enabling fine‑grained failure analysis.
  • Compares three representative LLMs (Granite 4 Small, Llama 4 Maverick, DeepSeek V3.1) across four realistic tool‑use tasks: filesystem navigation, text extraction, CSV analysis, and SQL querying.
  • Identifies four recurring failure archetypes that appear regardless of model size or architecture.
  • Shows that scale alone isn’t enough: a 400 B‑parameter model only marginally outperforms a 32 B‑parameter model on uncertainty‑driven tasks, while reinforcement‑learning fine‑tuning drives most of DeepSeek’s reliability.
  • Proposes concrete evaluation dimensions (interactive grounding, recovery behavior, environment‑aware adaptation) for future agentic benchmarks.

Methodology

  1. Benchmark Design (KAMI v0.1) – The authors built a suite of simulated “agentic” tasks that require multiple tool calls (e.g., opening a file, parsing its content, feeding results into a SQL query). Each trial logs every model decision, tool invocation, and response.
  2. Model Selection – Three open‑source LLMs were chosen to represent a spread of scale and training regimes:
    • Granite 4 Small (≈32 B parameters)
    • Llama 4 Maverick (≈400 B parameters)
    • DeepSeek V3.1 (≈70 B parameters, RL‑fine‑tuned)
  3. Task Scenarios – Four domains were covered:
    • Filesystem – locate, read, and modify files.
    • Text Extraction – pull specific snippets from unstructured documents.
    • CSV Analysis – compute aggregates, filter rows, join tables.
    • SQL – generate and run queries against a mock database.
  4. Trace Analysis – Rather than aggregating a single accuracy number, the authors manually inspected each trace, categorizing successes and failures into behavioral patterns.
  5. Failure Archetype Coding – Four failure types were defined a‑priori and then refined iteratively as new patterns emerged.

Results & Findings

ModelOverall Success Rate*Notable StrengthsMain Weaknesses
Granite 4 Small~58 %Handles deterministic file reads wellStruggles with ambiguous prompts; frequent “premature action”
Llama 4 Maverick~62 %Slightly better at uncertainty handlingStill prone to “over‑helpfulness” and context pollution
DeepSeek V3.1~78 %Robust recovery, fewer distractor errorsOccasionally brittle under heavy tool‑call load

*Success = completing the task within the allowed number of tool calls and producing a correct final answer.

Four failure archetypes uncovered

  1. Premature Action Without Grounding – The model issues a tool call before it has verified the necessary context (e.g., querying a DB before confirming the table name exists).
  2. Over‑Helpfulness – The agent fabricates missing entities (e.g., inventing a column name) to keep the conversation flowing, leading to silent logical errors.
  3. Distractor‑Induced Context Pollution – Irrelevant information in the prompt or previous steps contaminates the model’s reasoning, causing it to chase dead‑ends.
  4. Fragile Execution Under Load – When the number of required tool calls exceeds a modest threshold, the model’s internal state degrades, resulting in dropped calls or malformed commands.

Crisply, model size did not guarantee resilience; DeepSeek’s RL‑based post‑training gave it a decisive edge, suggesting that targeted fine‑tuning is more valuable than raw parameter count for agentic reliability.

Practical Implications

  • Enterprise AI Assistants – Companies building internal bots (e.g., for data retrieval or report generation) should prioritize reinforcement‑learning fine‑tuning and explicit verification steps over simply scaling up the base model.
  • Tool‑Use SDKs – SDK designers can embed “guardrails” (e.g., schema validation before a SQL call) to catch premature actions early, reducing the impact of the first failure archetype.
  • Prompt Engineering – Structuring prompts to isolate relevant context and explicitly request confirmation before tool calls can mitigate over‑helpfulness and distractor pollution.
  • Monitoring & Recovery – Deployments should log full execution traces (as KAMI does) and implement automated rollback or retry mechanisms when a trace shows signs of the “fragile execution” pattern.
  • Benchmarking Standards – The community can adopt KAMI‑style trace‑level evaluation to surface hidden bugs before shipping LLM‑powered agents to production.

Limitations & Future Work

  • Synthetic Environment – The benchmark runs in a controlled simulation; real‑world systems may introduce network latency, permission errors, or richer data modalities that were not tested.
  • Model Diversity – Only three models were examined; extending the study to newer open‑source and closed‑source LLMs (e.g., GPT‑4o, Claude) would validate whether the identified archetypes generalize.
  • Automated Failure Classification – Current analysis relies on manual trace inspection; future work could train a meta‑model to automatically flag the four failure types at scale.
  • User‑In‑the‑Loop Scenarios – The study assumes fully autonomous agents; incorporating intermittent human feedback could reveal additional robustness strategies.

By shining a light on how LLMs stumble rather than just how well they score, this research offers a roadmap for building truly dependable AI agents that can be trusted in everyday developer workflows and enterprise pipelines.

Authors

  • JV Roig

Paper Information

  • arXiv ID: 2512.07497v1
  • Categories: cs.AI, cs.SE
  • Published: December 8, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »