[Paper] ProbeLLM: Automating Principled Diagnosis of LLM Failures
Source: arXiv - 2602.12966v1
Overview
Large language models (LLMs) are getting bigger and more capable, but they still stumble in surprising ways. ProbeLLM introduces a systematic, budget‑aware way to automatically uncover not just isolated bugs but whole families of weaknesses—what the authors call structured failure modes. By treating probing as a hierarchical search problem, the framework promises a clearer, more actionable picture of where LLMs need improvement.
Key Contributions
- Hierarchical Monte Carlo Tree Search (MCTS) for probing – balances global exploration of new error regions with local refinement of recurring patterns.
- Benchmark‑agnostic design – works across any downstream task without hand‑crafted test suites.
- Tool‑augmented generation & verification – only retains test cases that can be automatically verified, ensuring reliable failure evidence.
- Failure‑aware embeddings + boundary‑aware induction – clusters raw failures into human‑readable “failure modes” with clear decision boundaries.
- Empirical validation – demonstrates broader, cleaner, and finer‑grained failure landscapes on multiple LLMs (e.g., GPT‑3.5, LLaMA‑2) compared with static benchmarks and prior probing methods.
Methodology
- Problem framing – Probing is cast as a hierarchical MCTS. The root node represents the whole input space; each child node corresponds to a more specific sub‑region (e.g., a particular prompt pattern).
- Budget allocation – The algorithm receives a fixed probing budget (e.g., 10 k generated prompts). At each step it decides whether to explore a new region (global) or exploit a promising region (local).
- Prompt generation – Uses LLM‑driven generation augmented with external tools (e.g., calculators, knowledge bases) to create candidate test cases.
- Verification – Each generated case is run through a verifier (rule‑based or tool‑backed) that checks whether the LLM’s output violates a known constraint (e.g., factual inconsistency, logical contradiction). Only verified failures are kept.
- Embedding & clustering – Failed cases are encoded with a failure‑aware embedding that captures both the prompt and the nature of the error. A boundary‑aware induction algorithm then groups them into interpretable clusters, each representing a distinct failure mode.
The whole pipeline runs automatically, requiring only the LLM under test, a verification toolkit, and a budget specification.
Results & Findings
| Model / Benchmark | # Failure Modes (ProbeLLM) | # Failure Modes (Static Suite) | Avg. Precision of Discovered Failures |
|---|---|---|---|
| GPT‑3.5 (QA) | 27 | 9 | 0.94 |
| LLaMA‑2‑13B (Summ.) | 31 | 12 | 0.91 |
| GPT‑4 (Code) | 22 | 8 | 0.96 |
- Broader coverage – ProbeLLM finds roughly 2–3× more distinct failure modes than traditional static benchmarks.
- Cleaner signals – Because every failure is verified, the false‑positive rate drops below 5 %, compared to >15 % in prior automated probing.
- Fine‑grained insights – The induced clusters expose subtle patterns (e.g., “mis‑interpreting negation in multi‑step reasoning” or “hallucinating dates when asked for historical timelines”).
Overall, the study shows that a principled exploration strategy yields a richer, more trustworthy map of LLM weaknesses.
Practical Implications
- Targeted model debugging – Engineers can prioritize fixing entire failure modes rather than chasing isolated bugs, accelerating iteration cycles.
- Continuous evaluation pipelines – ProbeLLM’s budget‑controlled, automated nature makes it suitable for CI/CD setups that continuously monitor new model releases.
- Safety & compliance – By grounding failures in verifiable constraints (e.g., legal or medical guidelines), organizations can generate audit‑ready evidence of model limitations.
- Prompt engineering – The discovered failure modes often point to systematic prompt patterns that should be avoided or re‑designed, informing better user‑facing APIs.
- Benchmark design – The framework can be used to augment existing test suites, ensuring they stay relevant as models evolve faster than static datasets.
Limitations & Future Work
- Verification dependency – The quality of discovered failures hinges on the availability of reliable, tool‑augmented verifiers; domains lacking such tools may see reduced coverage.
- Budget sensitivity – While the MCTS allocation is principled, very tight budgets can bias the search toward easy‑to‑find failures, potentially missing rare but critical modes.
- Interpretability of clusters – The induced failure modes are human‑readable but may still require domain experts to label and act upon them.
- Future directions – The authors suggest integrating reinforcement learning to adapt the budget dynamically, expanding tool suites for richer verification, and applying the approach to multimodal models (e.g., vision‑language).
ProbeLLM marks a step toward turning LLM evaluation from a static “checklist” into an active, discovery‑driven process—something that developers, product teams, and safety engineers can start leveraging right away.
Authors
- Yue Huang
- Zhengzhe Jiang
- Yuchen Ma
- Yu Jiang
- Xiangqi Wang
- Yujun Zhou
- Yuexing Hao
- Kehan Guo
- Pin‑Yu Chen
- Stefan Feuerriegel
- Xiangliang Zhang
Paper Information
- arXiv ID: 2602.12966v1
- Categories: cs.CL, cs.SE
- Published: February 13, 2026
- PDF: Download PDF