[Paper] Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say 'I Don't Know'

Published: 2 months ago (February 4, 2026 at 01:39 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.04853v1

Overview

Large language models (LLMs) are great at answering factual questions, but they often pretend they know the answer when they really don’t, producing confident hallucinations. The paper Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say “I Don’t Know” investigates whether breaking a question into smaller steps (decomposed prompting) can make LLMs more reliable, and discovers a simple way to let the model abstain when it’s unsure.

Key Contributions

Three prompting regimes compared:
- Direct – ask the model to answer in one shot.
- Assistive – provide an external “helper” prompt that supplies a hint.
- Incremental – decompose the question into sub‑questions and combine the answers.
Cross‑regime disagreement as a reliability signal: When the three regimes disagree, the answer is far more likely to be wrong.
Training‑free abstention policy: By refusing to answer whenever the regimes disagree, the system dramatically reduces hallucinations without any extra retrieval, fine‑tuning, or extra model parameters.
Extensive evaluation: Experiments on multiple multi‑hop QA benchmarks (e.g., HotpotQA, ComplexWebQuestions) and model sizes (from 2.7 B to 175 B parameters) show the method works across the board.
Benchmarking against standard uncertainty baselines: Disagreement‑based abstention outperforms entropy‑based and confidence‑score baselines in both F1 and AUROC.

Methodology

Prompt Design – The authors craft three functionally equivalent prompts that differ only in how the question is presented to the model.
Inference Pipeline – For each input question, the model is run three times (once per regime) and the three textual answers are collected.
Agreement Check – If all three answers are identical (or map to the same normalized answer), the system outputs that answer. If they differ, the model abstains (returns “I don’t know”).
Evaluation Metrics – Standard QA metrics (Exact Match, F1) are measured on the answered subset, while abstention quality is measured with AUROC and calibration curves.
Baselines – The authors compare against:
- Softmax confidence (max token probability).
- Entropy of the output distribution.
- Monte‑Carlo dropout (sampling‑based uncertainty).

The whole pipeline requires no additional training, retrieval, or external knowledge source—just multiple forward passes with different prompts.

Results & Findings

Model (size)	Baseline F1 (no abstention)	F1 after disagreement‑abstention	AUROC (error detection)
LLaMA‑2 7B	62.4 %	71.8 % (≈ 9 % gain)	0.84
LLaMA‑2 13B	68.1 %	76.3 %	0.88
GPT‑3 175B	78.5 %	84.2 %	0.91

Key takeaways

Accuracy gains from decomposition shrink as models get larger, confirming prior work that frontier models already internalize many reasoning steps.
Disagreement is a strong error predictor: when any two regimes disagree, the answer is wrong > 80 % of the time, regardless of model size.
Abstention improves overall quality: By refusing to answer on ambiguous cases, the system raises the precision of the answered set, which is valuable for safety‑critical applications.
No extra cost beyond extra forward passes: The method is computationally cheap compared to retrieval‑augmented pipelines.

Practical Implications

Safety‑first QA services – Companies can wrap any closed‑book LLM with a lightweight “confidence guardrail” that simply runs three prompts and drops answers when they don’t line up. This reduces the risk of delivering misinformation to end‑users.
Cost‑effective reliability – Since no fine‑tuning or external knowledge bases are required, the technique can be deployed on existing APIs (e.g., OpenAI, Anthropic) with minimal engineering effort.
Debugging tool for developers – The disagreement pattern can highlight topics where the model’s knowledge is shaky, guiding data collection or prompting strategies.
Composable pipelines – The approach can be combined with retrieval‑augmented generation (RAG): first try the disagreement‑abstention; if the model abstains, fall back to a retrieval step. This yields a hybrid system that only pays for expensive retrieval when necessary.
Regulatory compliance – In domains like healthcare or finance, being able to say “I don’t know” is often a legal requirement; this method offers a straightforward way to meet that demand.

Limitations & Future Work

Increased latency – Running three forward passes triples inference time; for real‑time applications, batching or model distillation may be needed.
Prompt sensitivity – The effectiveness hinges on the design of the three prompts; poorly chosen prompts could produce spurious disagreements.
Binary abstention – The current policy is a hard “yes/no” decision. Future work could explore graded confidence scores or partial answer generation.
Scope limited to multi‑hop QA – While the authors test on several benchmarks, it remains unclear how well the technique transfers to other tasks (e.g., code generation, summarization).
Scaling to larger ensembles – Investigating whether adding more diverse prompts or models further improves reliability without prohibitive cost is an open question.

Authors

Dhruv Madhwal
Lyuxin David Zhang
Dan Roth
Tomer Wolfson
Vivek Gupta

Paper Information

arXiv ID: 2602.04853v1
Categories: cs.CL
Published: February 4, 2026
PDF: Download PDF

[Paper] Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say 'I Don't Know'

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks