[Paper] ProbeLLM: Automating Principled Diagnosis of LLM Failures

Published: 2 months ago (February 13, 2026 at 09:33 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.12966v1

Overview

Large language models (LLMs) are getting bigger and more capable, but they still stumble in surprising ways. ProbeLLM introduces a systematic, budget‑aware way to automatically uncover not just isolated bugs but whole families of weaknesses—what the authors call structured failure modes. By treating probing as a hierarchical search problem, the framework promises a clearer, more actionable picture of where LLMs need improvement.

Key Contributions

Hierarchical Monte Carlo Tree Search (MCTS) for probing – balances global exploration of new error regions with local refinement of recurring patterns.
Benchmark‑agnostic design – works across any downstream task without hand‑crafted test suites.
Tool‑augmented generation & verification – only retains test cases that can be automatically verified, ensuring reliable failure evidence.
Failure‑aware embeddings + boundary‑aware induction – clusters raw failures into human‑readable “failure modes” with clear decision boundaries.
Empirical validation – demonstrates broader, cleaner, and finer‑grained failure landscapes on multiple LLMs (e.g., GPT‑3.5, LLaMA‑2) compared with static benchmarks and prior probing methods.

Methodology

Problem framing – Probing is cast as a hierarchical MCTS. The root node represents the whole input space; each child node corresponds to a more specific sub‑region (e.g., a particular prompt pattern).
Budget allocation – The algorithm receives a fixed probing budget (e.g., 10 k generated prompts). At each step it decides whether to explore a new region (global) or exploit a promising region (local).
Prompt generation – Uses LLM‑driven generation augmented with external tools (e.g., calculators, knowledge bases) to create candidate test cases.
Verification – Each generated case is run through a verifier (rule‑based or tool‑backed) that checks whether the LLM’s output violates a known constraint (e.g., factual inconsistency, logical contradiction). Only verified failures are kept.
Embedding & clustering – Failed cases are encoded with a failure‑aware embedding that captures both the prompt and the nature of the error. A boundary‑aware induction algorithm then groups them into interpretable clusters, each representing a distinct failure mode.

The whole pipeline runs automatically, requiring only the LLM under test, a verification toolkit, and a budget specification.

Results & Findings

Model / Benchmark	# Failure Modes (ProbeLLM)	# Failure Modes (Static Suite)	Avg. Precision of Discovered Failures
GPT‑3.5 (QA)	27	9	0.94
LLaMA‑2‑13B (Summ.)	31	12	0.91
GPT‑4 (Code)	22	8	0.96

Broader coverage – ProbeLLM finds roughly 2–3× more distinct failure modes than traditional static benchmarks.
Cleaner signals – Because every failure is verified, the false‑positive rate drops below 5 %, compared to >15 % in prior automated probing.
Fine‑grained insights – The induced clusters expose subtle patterns (e.g., “mis‑interpreting negation in multi‑step reasoning” or “hallucinating dates when asked for historical timelines”).

Overall, the study shows that a principled exploration strategy yields a richer, more trustworthy map of LLM weaknesses.

Practical Implications

Targeted model debugging – Engineers can prioritize fixing entire failure modes rather than chasing isolated bugs, accelerating iteration cycles.
Continuous evaluation pipelines – ProbeLLM’s budget‑controlled, automated nature makes it suitable for CI/CD setups that continuously monitor new model releases.
Safety & compliance – By grounding failures in verifiable constraints (e.g., legal or medical guidelines), organizations can generate audit‑ready evidence of model limitations.
Prompt engineering – The discovered failure modes often point to systematic prompt patterns that should be avoided or re‑designed, informing better user‑facing APIs.
Benchmark design – The framework can be used to augment existing test suites, ensuring they stay relevant as models evolve faster than static datasets.

Limitations & Future Work

Verification dependency – The quality of discovered failures hinges on the availability of reliable, tool‑augmented verifiers; domains lacking such tools may see reduced coverage.
Budget sensitivity – While the MCTS allocation is principled, very tight budgets can bias the search toward easy‑to‑find failures, potentially missing rare but critical modes.
Interpretability of clusters – The induced failure modes are human‑readable but may still require domain experts to label and act upon them.
Future directions – The authors suggest integrating reinforcement learning to adapt the budget dynamically, expanding tool suites for richer verification, and applying the approach to multimodal models (e.g., vision‑language).

ProbeLLM marks a step toward turning LLM evaluation from a static “checklist” into an active, discovery‑driven process—something that developers, product teams, and safety engineers can start leveraging right away.

Authors

Yue Huang
Zhengzhe Jiang
Yuchen Ma
Yu Jiang
Xiangqi Wang
Yujun Zhou
Yuexing Hao
Kehan Guo
Pin‑Yu Chen
Stefan Feuerriegel
Xiangliang Zhang

Paper Information

arXiv ID: 2602.12966v1
Categories: cs.CL, cs.SE
Published: February 13, 2026
PDF: Download PDF

[Paper] ProbeLLM: Automating Principled Diagnosis of LLM Failures

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

[Paper] OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report