[Paper] ProbeLLM: Automating Principled Diagnosis of LLM Failures

Published: (February 13, 2026 at 09:33 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.12966v1

Overview

Large language models (LLMs) are getting bigger and more capable, but they still stumble in surprising ways. ProbeLLM introduces a systematic, budget‑aware way to automatically uncover not just isolated bugs but whole families of weaknesses—what the authors call structured failure modes. By treating probing as a hierarchical search problem, the framework promises a clearer, more actionable picture of where LLMs need improvement.

Key Contributions

  • Hierarchical Monte Carlo Tree Search (MCTS) for probing – balances global exploration of new error regions with local refinement of recurring patterns.
  • Benchmark‑agnostic design – works across any downstream task without hand‑crafted test suites.
  • Tool‑augmented generation & verification – only retains test cases that can be automatically verified, ensuring reliable failure evidence.
  • Failure‑aware embeddings + boundary‑aware induction – clusters raw failures into human‑readable “failure modes” with clear decision boundaries.
  • Empirical validation – demonstrates broader, cleaner, and finer‑grained failure landscapes on multiple LLMs (e.g., GPT‑3.5, LLaMA‑2) compared with static benchmarks and prior probing methods.

Methodology

  1. Problem framing – Probing is cast as a hierarchical MCTS. The root node represents the whole input space; each child node corresponds to a more specific sub‑region (e.g., a particular prompt pattern).
  2. Budget allocation – The algorithm receives a fixed probing budget (e.g., 10 k generated prompts). At each step it decides whether to explore a new region (global) or exploit a promising region (local).
  3. Prompt generation – Uses LLM‑driven generation augmented with external tools (e.g., calculators, knowledge bases) to create candidate test cases.
  4. Verification – Each generated case is run through a verifier (rule‑based or tool‑backed) that checks whether the LLM’s output violates a known constraint (e.g., factual inconsistency, logical contradiction). Only verified failures are kept.
  5. Embedding & clustering – Failed cases are encoded with a failure‑aware embedding that captures both the prompt and the nature of the error. A boundary‑aware induction algorithm then groups them into interpretable clusters, each representing a distinct failure mode.

The whole pipeline runs automatically, requiring only the LLM under test, a verification toolkit, and a budget specification.

Results & Findings

Model / Benchmark# Failure Modes (ProbeLLM)# Failure Modes (Static Suite)Avg. Precision of Discovered Failures
GPT‑3.5 (QA)2790.94
LLaMA‑2‑13B (Summ.)31120.91
GPT‑4 (Code)2280.96
  • Broader coverage – ProbeLLM finds roughly 2–3× more distinct failure modes than traditional static benchmarks.
  • Cleaner signals – Because every failure is verified, the false‑positive rate drops below 5 %, compared to >15 % in prior automated probing.
  • Fine‑grained insights – The induced clusters expose subtle patterns (e.g., “mis‑interpreting negation in multi‑step reasoning” or “hallucinating dates when asked for historical timelines”).

Overall, the study shows that a principled exploration strategy yields a richer, more trustworthy map of LLM weaknesses.

Practical Implications

  • Targeted model debugging – Engineers can prioritize fixing entire failure modes rather than chasing isolated bugs, accelerating iteration cycles.
  • Continuous evaluation pipelines – ProbeLLM’s budget‑controlled, automated nature makes it suitable for CI/CD setups that continuously monitor new model releases.
  • Safety & compliance – By grounding failures in verifiable constraints (e.g., legal or medical guidelines), organizations can generate audit‑ready evidence of model limitations.
  • Prompt engineering – The discovered failure modes often point to systematic prompt patterns that should be avoided or re‑designed, informing better user‑facing APIs.
  • Benchmark design – The framework can be used to augment existing test suites, ensuring they stay relevant as models evolve faster than static datasets.

Limitations & Future Work

  • Verification dependency – The quality of discovered failures hinges on the availability of reliable, tool‑augmented verifiers; domains lacking such tools may see reduced coverage.
  • Budget sensitivity – While the MCTS allocation is principled, very tight budgets can bias the search toward easy‑to‑find failures, potentially missing rare but critical modes.
  • Interpretability of clusters – The induced failure modes are human‑readable but may still require domain experts to label and act upon them.
  • Future directions – The authors suggest integrating reinforcement learning to adapt the budget dynamically, expanding tool suites for richer verification, and applying the approach to multimodal models (e.g., vision‑language).

ProbeLLM marks a step toward turning LLM evaluation from a static “checklist” into an active, discovery‑driven process—something that developers, product teams, and safety engineers can start leveraging right away.

Authors

  • Yue Huang
  • Zhengzhe Jiang
  • Yuchen Ma
  • Yu Jiang
  • Xiangqi Wang
  • Yujun Zhou
  • Yuexing Hao
  • Kehan Guo
  • Pin‑Yu Chen
  • Stefan Feuerriegel
  • Xiangliang Zhang

Paper Information

  • arXiv ID: 2602.12966v1
  • Categories: cs.CL, cs.SE
  • Published: February 13, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »