[Paper] Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

Published: 2 months ago (December 5, 2025 at 11:38 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.05863v1

Overview

This paper investigates how to make medical question‑answering (QA) systems both accurate and trustworthy by pairing open‑source large language models (LLMs) with a Retrieval‑Augmented Generation (RAG) pipeline. By fine‑tuning LLaMA 2 and Falcon using Low‑Rank Adaptation (LoRA) and grounding their responses in retrieved PubMed literature, the authors achieve a dramatic jump in factual correctness over zero‑shot use of the same models.

Key Contributions

RAG‑based architecture that couples domain‑specific document retrieval with open‑source LLMs for biomedical QA.
Efficient fine‑tuning of LLaMA 2 and Falcon via LoRA, enabling rapid domain adaptation without full model retraining.
Empirical benchmark on PubMedQA and MedMCQA showing a 16‑point accuracy lift (71.8 % vs. 55.4 % zero‑shot) and a ~60 % reduction in hallucinated content.
Transparency layer that automatically attaches source citations to each generated answer, improving auditability for clinicians.
Open‑source reproducibility package (code, LoRA weights, and retrieval index) released for the community.

Methodology

Document Corpus Construction – The authors built a searchable index of ~2 M PubMed abstracts and full‑text articles using dense embeddings (Sentence‑Transformers) and a vector database (FAISS).
Retrieval Step – For any user query, the top‑k (k = 5) most relevant passages are fetched based on cosine similarity.
Prompt Engineering – Retrieved passages are concatenated with a system prompt that instructs the LLM to cite sources and to answer concisely.
Model Fine‑Tuning – LoRA adapters (rank = 8) are trained on a curated set of 10 k medical QA pairs (derived from PubMedQA, MedMCQA, and manually verified examples). This adds only ~0.1 % extra parameters, keeping compute costs low.
Generation & Post‑Processing – The LLM generates an answer; a lightweight verifier checks that each claim is linked to at least one retrieved passage, flagging unsupported statements.

The pipeline is modular, allowing any compatible LLM to be swapped in without re‑building the retrieval index.

Results & Findings

Model (setup)	PubMedQA Accuracy	MedMCQA Accuracy	Hallucination Reduction
Zero‑shot LLaMA 2 (no RAG)	55.4 %	48.1 %	—
Zero‑shot LLaMA 2 + RAG	63.2 %	55.7 %	~35 %
LoRA‑fine‑tuned LLaMA 2 + RAG	71.8 %	63.4 %	~60 %
LoRA‑fine‑tuned Falcon + RAG	68.5 %	60.9 %	~55 %

Adding retrieval alone already boosts performance by 7–8 percentage points.
Fine‑tuning with LoRA yields an additional ~8 pp gain, surpassing many closed‑source proprietary baselines.
The citation‑aware verifier cuts unsupported statements from roughly 30 % of generated tokens to under 12 %.

Practical Implications

Developer‑ready toolkit – The modular RAG stack (FAISS + Sentence‑Transformers + LoRA‑enabled LLM) can be dropped into existing health‑tech platforms (e.g., tele‑triage bots, EHR decision support).
Cost‑effective specialization – LoRA fine‑tuning runs on a single 24 GB GPU in under 4 hours, making domain adaptation feasible for startups without massive compute budgets.
Regulatory friendliness – Automatic source attribution satisfies emerging transparency requirements for AI in healthcare, easing audit trails for FDA or EMA submissions.
Scalable to other domains – The same pattern (retrieval + lightweight adapter) can be reused for legal, financial, or scientific QA, reducing the need for massive domain‑specific corpora.

Limitations & Future Work

Corpus freshness – The retrieval index is static; emerging medical literature (e.g., COVID‑19 studies) would require periodic re‑indexing.
Answer depth – While factual accuracy improves, the system still struggles with multi‑step reasoning or nuanced clinical judgment.
Evaluation scope – Benchmarks focus on multiple‑choice QA; real‑world conversational settings (follow‑up questions, ambiguous phrasing) remain untested.
Future directions suggested by the authors include: integrating a live PubMed API for on‑the‑fly updates, exploring chain‑of‑thought prompting to boost reasoning, and extending the verifier to flag potential bias in retrieved sources.

Authors

Tasnimul Hassan
Md Faisal Karim
Haziq Jeelani
Elham Behnam
Robert Green
Fayeq Jeelani Syed

Paper Information

arXiv ID: 2512.05863v1
Categories: cs.CL, cs.AI
Published: December 5, 2025
PDF: Download PDF

[Paper] Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis