[Paper] To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

Published: 3 days ago (February 23, 2026 at 01:42 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.20130v1

Overview

The paper introduces Selective Chain‑of‑Thought (Selective CoT), an inference‑time trick that lets a large language model decide on the fly whether a medical question actually needs a step‑by‑step reasoning trace. By generating a rationale only for the “hard” questions, the method cuts down on token usage and latency while keeping answer quality virtually unchanged—an attractive proposition for any developer looking to ship LLM‑powered clinical assistants at scale.

Key Contributions

Dynamic reasoning decision: A lightweight classifier predicts if a question warrants a CoT explanation before the model starts generating one.
Model‑agnostic plug‑in: Works with off‑the‑shelf open‑source LLMs (Llama‑3.1‑8B, Qwen‑2.5‑7B) without any fine‑tuning of the base model.
Efficiency gains: Reduces inference time by 13 %–45 % and token consumption by 8 %–47 % across four biomedical QA benchmarks.
Minimal accuracy trade‑off: Accuracy drops by at most 4 %, and in several settings it actually improves over standard CoT.
Comparison with fixed‑length CoT: Shows that a dynamic “reason‑when‑needed” policy matches or exceeds the performance of a naïve fixed‑step reasoning baseline while using far fewer resources.

Methodology

Two‑stage inference pipeline
- Stage 1 – Reasoning‑need classifier: A small prompt (or a tiny fine‑tuned head) asks the LLM to output a binary signal (“reason” vs. “no‑reason”) based on the question alone.
- Stage 2 – Answer generation
  - If the classifier says reason, the model runs a classic Chain‑of‑Thought prompt that forces it to produce a step‑by‑step rationale before the final answer.
  - If the classifier says no‑reason, the model skips the rationale and directly emits the answer (a “direct answer” prompt).
Benchmarks & models
- Four public MedQA datasets: HeadQA, MedQA‑USMLE, MedMCQA, and PubMedQA.
- Two open‑source LLMs representing different architectures: Llama‑3.1‑8B (Meta) and Qwen‑2.5‑7B (Alibaba).
Metrics
- Accuracy (exact‑match or multiple‑choice correctness).
- Total generated tokens (proxy for compute cost).
- Inference latency measured on identical hardware.
Baselines
- Standard CoT (always generate a rationale).
- Fixed‑length CoT (pre‑defined number of reasoning steps).

The whole system is implemented with a single forward pass for the classifier and a conditional second pass for the answer, making it easy to drop into existing pipelines.

Results & Findings

Model / Dataset	Standard CoT Acc.	Selective CoT Acc.	Δ Accuracy	Token Savings	Latency Reduction
Llama‑3.1‑8B / HeadQA	78.2 %	77.9 %	–0.3 %	31 %	28 %
Qwen‑2.5‑7B / MedQA‑USMLE	71.5 %	71.8 %	+0.3 %	45 %	42 %
Llama‑3.1‑8B / MedMCQA	66.0 %	65.5 %	–0.5 %	22 %	19 %
Qwen‑2.5‑7B / PubMedQA	78.9 %	78.7 %	–0.2 %	47 %	44 %

Key take‑aways

Efficiency wins: Across the board, Selective CoT slashes token count and wall‑clock time, with the biggest gains on datasets where many questions are recall‑type (e.g., PubMedQA).
Accuracy stays competitive: The worst‑case drop is under 4 %, and in two model‑task pairs the selective approach actually nudges accuracy upward—likely because noisy reasoning is avoided on simple questions.
Interpretability retained: For the subset of questions that do trigger CoT, developers still get a human‑readable rationale, preserving the auditability that many medical AI regulations demand.

Practical Implications

Cost‑effective deployment: Cloud providers charge per token or per GPU second. Cutting token usage by up to half can translate into significant OPEX savings, especially for high‑throughput clinical chatbots.
Latency‑critical use cases: Faster responses are crucial in triage or decision‑support tools where clinicians cannot wait for a long LLM inference. Selective CoT brings sub‑second improvements without sacrificing safety.
Dynamic workload balancing: In multi‑tenant SaaS platforms, the classifier can be used as a throttling knob—routing “easy” queries to a lightweight direct‑answer path and reserving full CoT resources for complex cases.
Regulatory friendliness: By generating rationales only when needed, the system still provides traceability for high‑risk decisions, helping meet documentation requirements (e.g., FDA’s “explainability” guidance).
Plug‑and‑play: Since the method works with any off‑the‑shelf LLM, teams can retrofit existing pipelines (Hugging Face Transformers, LangChain, etc.) with just a few lines of code.

Limitations & Future Work

Binary decision granularity: The current classifier makes a coarse “reason / no‑reason” call. Some questions might benefit from a short rationale rather than a full CoT, suggesting a multi‑level reasoning depth controller.
Domain shift risk: The classifier is trained on the same benchmarks it is evaluated on; performance on out‑of‑distribution clinical queries (e.g., rare disease case reports) remains untested.
Explainability trade‑off: For the “no‑reason” path, there is no explicit rationale, which could be a compliance hurdle for certain regulated scenarios.
Scalability to larger models: Experiments were limited to 7‑8 B‑parameter models. It remains to be seen whether the same relative savings hold for 70 B‑scale LLMs where the cost of a single forward pass dominates.
Future directions include: (1) training a confidence‑aware selector that can output a “partial CoT” length, (2) evaluating on real‑world clinical conversation logs, and (3) integrating reinforcement learning to let the selector optimize a joint utility of accuracy vs. latency.

Authors

Zaifu Zhan
Min Zeng
Shuang Zhou
Yiran Song
Xiaoyi Chen
Yu Hou
Yifan Wu
Yang Ruan
Rui Zhang

Paper Information

arXiv ID: 2602.20130v1
Categories: cs.CL, cs.AI
Published: February 23, 2026
PDF: Download PDF

[Paper] To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

[Paper] Dynamic Personality Adaptation in Large Language Models via State Machines

[Paper] When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models