[Paper] Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say 'I Don't Know'
Source: arXiv - 2602.04853v1
Overview
Large language models (LLMs) are great at answering factual questions, but they often pretend they know the answer when they really don’t, producing confident hallucinations. The paper Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say “I Don’t Know” investigates whether breaking a question into smaller steps (decomposed prompting) can make LLMs more reliable, and discovers a simple way to let the model abstain when it’s unsure.
Key Contributions
- Three prompting regimes compared:
- Direct – ask the model to answer in one shot.
- Assistive – provide an external “helper” prompt that supplies a hint.
- Incremental – decompose the question into sub‑questions and combine the answers.
- Cross‑regime disagreement as a reliability signal: When the three regimes disagree, the answer is far more likely to be wrong.
- Training‑free abstention policy: By refusing to answer whenever the regimes disagree, the system dramatically reduces hallucinations without any extra retrieval, fine‑tuning, or extra model parameters.
- Extensive evaluation: Experiments on multiple multi‑hop QA benchmarks (e.g., HotpotQA, ComplexWebQuestions) and model sizes (from 2.7 B to 175 B parameters) show the method works across the board.
- Benchmarking against standard uncertainty baselines: Disagreement‑based abstention outperforms entropy‑based and confidence‑score baselines in both F1 and AUROC.
Methodology
- Prompt Design – The authors craft three functionally equivalent prompts that differ only in how the question is presented to the model.
- Inference Pipeline – For each input question, the model is run three times (once per regime) and the three textual answers are collected.
- Agreement Check – If all three answers are identical (or map to the same normalized answer), the system outputs that answer. If they differ, the model abstains (returns “I don’t know”).
- Evaluation Metrics – Standard QA metrics (Exact Match, F1) are measured on the answered subset, while abstention quality is measured with AUROC and calibration curves.
- Baselines – The authors compare against:
- Softmax confidence (max token probability).
- Entropy of the output distribution.
- Monte‑Carlo dropout (sampling‑based uncertainty).
The whole pipeline requires no additional training, retrieval, or external knowledge source—just multiple forward passes with different prompts.
Results & Findings
| Model (size) | Baseline F1 (no abstention) | F1 after disagreement‑abstention | AUROC (error detection) |
|---|---|---|---|
| LLaMA‑2 7B | 62.4 % | 71.8 % (≈ 9 % gain) | 0.84 |
| LLaMA‑2 13B | 68.1 % | 76.3 % | 0.88 |
| GPT‑3 175B | 78.5 % | 84.2 % | 0.91 |
Key takeaways
- Accuracy gains from decomposition shrink as models get larger, confirming prior work that frontier models already internalize many reasoning steps.
- Disagreement is a strong error predictor: when any two regimes disagree, the answer is wrong > 80 % of the time, regardless of model size.
- Abstention improves overall quality: By refusing to answer on ambiguous cases, the system raises the precision of the answered set, which is valuable for safety‑critical applications.
- No extra cost beyond extra forward passes: The method is computationally cheap compared to retrieval‑augmented pipelines.
Practical Implications
- Safety‑first QA services – Companies can wrap any closed‑book LLM with a lightweight “confidence guardrail” that simply runs three prompts and drops answers when they don’t line up. This reduces the risk of delivering misinformation to end‑users.
- Cost‑effective reliability – Since no fine‑tuning or external knowledge bases are required, the technique can be deployed on existing APIs (e.g., OpenAI, Anthropic) with minimal engineering effort.
- Debugging tool for developers – The disagreement pattern can highlight topics where the model’s knowledge is shaky, guiding data collection or prompting strategies.
- Composable pipelines – The approach can be combined with retrieval‑augmented generation (RAG): first try the disagreement‑abstention; if the model abstains, fall back to a retrieval step. This yields a hybrid system that only pays for expensive retrieval when necessary.
- Regulatory compliance – In domains like healthcare or finance, being able to say “I don’t know” is often a legal requirement; this method offers a straightforward way to meet that demand.
Limitations & Future Work
- Increased latency – Running three forward passes triples inference time; for real‑time applications, batching or model distillation may be needed.
- Prompt sensitivity – The effectiveness hinges on the design of the three prompts; poorly chosen prompts could produce spurious disagreements.
- Binary abstention – The current policy is a hard “yes/no” decision. Future work could explore graded confidence scores or partial answer generation.
- Scope limited to multi‑hop QA – While the authors test on several benchmarks, it remains unclear how well the technique transfers to other tasks (e.g., code generation, summarization).
- Scaling to larger ensembles – Investigating whether adding more diverse prompts or models further improves reliability without prohibitive cost is an open question.
Authors
- Dhruv Madhwal
- Lyuxin David Zhang
- Dan Roth
- Tomer Wolfson
- Vivek Gupta
Paper Information
- arXiv ID: 2602.04853v1
- Categories: cs.CL
- Published: February 4, 2026
- PDF: Download PDF