[Paper] Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework
Source: arXiv - 2512.05863v1
Overview
This paper investigates how to make medical question‑answering (QA) systems both accurate and trustworthy by pairing open‑source large language models (LLMs) with a Retrieval‑Augmented Generation (RAG) pipeline. By fine‑tuning LLaMA 2 and Falcon using Low‑Rank Adaptation (LoRA) and grounding their responses in retrieved PubMed literature, the authors achieve a dramatic jump in factual correctness over zero‑shot use of the same models.
Key Contributions
- RAG‑based architecture that couples domain‑specific document retrieval with open‑source LLMs for biomedical QA.
- Efficient fine‑tuning of LLaMA 2 and Falcon via LoRA, enabling rapid domain adaptation without full model retraining.
- Empirical benchmark on PubMedQA and MedMCQA showing a 16‑point accuracy lift (71.8 % vs. 55.4 % zero‑shot) and a ~60 % reduction in hallucinated content.
- Transparency layer that automatically attaches source citations to each generated answer, improving auditability for clinicians.
- Open‑source reproducibility package (code, LoRA weights, and retrieval index) released for the community.
Methodology
- Document Corpus Construction – The authors built a searchable index of ~2 M PubMed abstracts and full‑text articles using dense embeddings (Sentence‑Transformers) and a vector database (FAISS).
- Retrieval Step – For any user query, the top‑k (k = 5) most relevant passages are fetched based on cosine similarity.
- Prompt Engineering – Retrieved passages are concatenated with a system prompt that instructs the LLM to cite sources and to answer concisely.
- Model Fine‑Tuning – LoRA adapters (rank = 8) are trained on a curated set of 10 k medical QA pairs (derived from PubMedQA, MedMCQA, and manually verified examples). This adds only ~0.1 % extra parameters, keeping compute costs low.
- Generation & Post‑Processing – The LLM generates an answer; a lightweight verifier checks that each claim is linked to at least one retrieved passage, flagging unsupported statements.
The pipeline is modular, allowing any compatible LLM to be swapped in without re‑building the retrieval index.
Results & Findings
| Model (setup) | PubMedQA Accuracy | MedMCQA Accuracy | Hallucination Reduction |
|---|---|---|---|
| Zero‑shot LLaMA 2 (no RAG) | 55.4 % | 48.1 % | — |
| Zero‑shot LLaMA 2 + RAG | 63.2 % | 55.7 % | ~35 % |
| LoRA‑fine‑tuned LLaMA 2 + RAG | 71.8 % | 63.4 % | ~60 % |
| LoRA‑fine‑tuned Falcon + RAG | 68.5 % | 60.9 % | ~55 % |
- Adding retrieval alone already boosts performance by 7–8 percentage points.
- Fine‑tuning with LoRA yields an additional ~8 pp gain, surpassing many closed‑source proprietary baselines.
- The citation‑aware verifier cuts unsupported statements from roughly 30 % of generated tokens to under 12 %.
Practical Implications
- Developer‑ready toolkit – The modular RAG stack (FAISS + Sentence‑Transformers + LoRA‑enabled LLM) can be dropped into existing health‑tech platforms (e.g., tele‑triage bots, EHR decision support).
- Cost‑effective specialization – LoRA fine‑tuning runs on a single 24 GB GPU in under 4 hours, making domain adaptation feasible for startups without massive compute budgets.
- Regulatory friendliness – Automatic source attribution satisfies emerging transparency requirements for AI in healthcare, easing audit trails for FDA or EMA submissions.
- Scalable to other domains – The same pattern (retrieval + lightweight adapter) can be reused for legal, financial, or scientific QA, reducing the need for massive domain‑specific corpora.
Limitations & Future Work
- Corpus freshness – The retrieval index is static; emerging medical literature (e.g., COVID‑19 studies) would require periodic re‑indexing.
- Answer depth – While factual accuracy improves, the system still struggles with multi‑step reasoning or nuanced clinical judgment.
- Evaluation scope – Benchmarks focus on multiple‑choice QA; real‑world conversational settings (follow‑up questions, ambiguous phrasing) remain untested.
- Future directions suggested by the authors include: integrating a live PubMed API for on‑the‑fly updates, exploring chain‑of‑thought prompting to boost reasoning, and extending the verifier to flag potential bias in retrieved sources.
Authors
- Tasnimul Hassan
- Md Faisal Karim
- Haziq Jeelani
- Elham Behnam
- Robert Green
- Fayeq Jeelani Syed
Paper Information
- arXiv ID: 2512.05863v1
- Categories: cs.CL, cs.AI
- Published: December 5, 2025
- PDF: Download PDF