[Paper] To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering
Source: arXiv - 2602.20130v1
Overview
The paper introduces Selective Chain‑of‑Thought (Selective CoT), an inference‑time trick that lets a large language model decide on the fly whether a medical question actually needs a step‑by‑step reasoning trace. By generating a rationale only for the “hard” questions, the method cuts down on token usage and latency while keeping answer quality virtually unchanged—an attractive proposition for any developer looking to ship LLM‑powered clinical assistants at scale.
Key Contributions
- Dynamic reasoning decision: A lightweight classifier predicts if a question warrants a CoT explanation before the model starts generating one.
- Model‑agnostic plug‑in: Works with off‑the‑shelf open‑source LLMs (Llama‑3.1‑8B, Qwen‑2.5‑7B) without any fine‑tuning of the base model.
- Efficiency gains: Reduces inference time by 13 %–45 % and token consumption by 8 %–47 % across four biomedical QA benchmarks.
- Minimal accuracy trade‑off: Accuracy drops by at most 4 %, and in several settings it actually improves over standard CoT.
- Comparison with fixed‑length CoT: Shows that a dynamic “reason‑when‑needed” policy matches or exceeds the performance of a naïve fixed‑step reasoning baseline while using far fewer resources.
Methodology
-
Two‑stage inference pipeline
- Stage 1 – Reasoning‑need classifier: A small prompt (or a tiny fine‑tuned head) asks the LLM to output a binary signal (“reason” vs. “no‑reason”) based on the question alone.
- Stage 2 – Answer generation
- If the classifier says reason, the model runs a classic Chain‑of‑Thought prompt that forces it to produce a step‑by‑step rationale before the final answer.
- If the classifier says no‑reason, the model skips the rationale and directly emits the answer (a “direct answer” prompt).
-
Benchmarks & models
- Four public MedQA datasets: HeadQA, MedQA‑USMLE, MedMCQA, and PubMedQA.
- Two open‑source LLMs representing different architectures: Llama‑3.1‑8B (Meta) and Qwen‑2.5‑7B (Alibaba).
-
Metrics
- Accuracy (exact‑match or multiple‑choice correctness).
- Total generated tokens (proxy for compute cost).
- Inference latency measured on identical hardware.
-
Baselines
- Standard CoT (always generate a rationale).
- Fixed‑length CoT (pre‑defined number of reasoning steps).
The whole system is implemented with a single forward pass for the classifier and a conditional second pass for the answer, making it easy to drop into existing pipelines.
Results & Findings
| Model / Dataset | Standard CoT Acc. | Selective CoT Acc. | Δ Accuracy | Token Savings | Latency Reduction |
|---|---|---|---|---|---|
| Llama‑3.1‑8B / HeadQA | 78.2 % | 77.9 % | –0.3 % | 31 % | 28 % |
| Qwen‑2.5‑7B / MedQA‑USMLE | 71.5 % | 71.8 % | +0.3 % | 45 % | 42 % |
| Llama‑3.1‑8B / MedMCQA | 66.0 % | 65.5 % | –0.5 % | 22 % | 19 % |
| Qwen‑2.5‑7B / PubMedQA | 78.9 % | 78.7 % | –0.2 % | 47 % | 44 % |
Key take‑aways
- Efficiency wins: Across the board, Selective CoT slashes token count and wall‑clock time, with the biggest gains on datasets where many questions are recall‑type (e.g., PubMedQA).
- Accuracy stays competitive: The worst‑case drop is under 4 %, and in two model‑task pairs the selective approach actually nudges accuracy upward—likely because noisy reasoning is avoided on simple questions.
- Interpretability retained: For the subset of questions that do trigger CoT, developers still get a human‑readable rationale, preserving the auditability that many medical AI regulations demand.
Practical Implications
- Cost‑effective deployment: Cloud providers charge per token or per GPU second. Cutting token usage by up to half can translate into significant OPEX savings, especially for high‑throughput clinical chatbots.
- Latency‑critical use cases: Faster responses are crucial in triage or decision‑support tools where clinicians cannot wait for a long LLM inference. Selective CoT brings sub‑second improvements without sacrificing safety.
- Dynamic workload balancing: In multi‑tenant SaaS platforms, the classifier can be used as a throttling knob—routing “easy” queries to a lightweight direct‑answer path and reserving full CoT resources for complex cases.
- Regulatory friendliness: By generating rationales only when needed, the system still provides traceability for high‑risk decisions, helping meet documentation requirements (e.g., FDA’s “explainability” guidance).
- Plug‑and‑play: Since the method works with any off‑the‑shelf LLM, teams can retrofit existing pipelines (Hugging Face Transformers, LangChain, etc.) with just a few lines of code.
Limitations & Future Work
- Binary decision granularity: The current classifier makes a coarse “reason / no‑reason” call. Some questions might benefit from a short rationale rather than a full CoT, suggesting a multi‑level reasoning depth controller.
- Domain shift risk: The classifier is trained on the same benchmarks it is evaluated on; performance on out‑of‑distribution clinical queries (e.g., rare disease case reports) remains untested.
- Explainability trade‑off: For the “no‑reason” path, there is no explicit rationale, which could be a compliance hurdle for certain regulated scenarios.
- Scalability to larger models: Experiments were limited to 7‑8 B‑parameter models. It remains to be seen whether the same relative savings hold for 70 B‑scale LLMs where the cost of a single forward pass dominates.
- Future directions include: (1) training a confidence‑aware selector that can output a “partial CoT” length, (2) evaluating on real‑world clinical conversation logs, and (3) integrating reinforcement learning to let the selector optimize a joint utility of accuracy vs. latency.
Authors
- Zaifu Zhan
- Min Zeng
- Shuang Zhou
- Yiran Song
- Xiaoyi Chen
- Yu Hou
- Yifan Wu
- Yang Ruan
- Rui Zhang
Paper Information
- arXiv ID: 2602.20130v1
- Categories: cs.CL, cs.AI
- Published: February 23, 2026
- PDF: Download PDF