[Paper] A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine
Source: arXiv - 2601.22124v1
Overview
A new study proposes Fed‑MedLoRA, a federated‑learning framework that lets multiple hospitals fine‑tune massive language models for medical tasks without sharing raw patient data or the full model weights. By sending only tiny low‑rank adapters, the approach slashes communication costs and tackles the notorious data‑heterogeneity problem that plagues traditional federated learning in healthcare.
Key Contributions
- Parameter‑efficient federated learning: Introduces Fed‑MedLoRA, which transmits only LoRA (Low‑Rank Adaptation) adapters instead of the entire multi‑billion‑parameter LLM.
- Heterogeneity‑aware aggregation: Extends the base method to Fed‑MedLoRA+, adding an adaptive, data‑aware weighting scheme that improves convergence when sites have vastly different patient populations and documentation styles.
- Real‑world medical IE benchmark: Applies the framework to clinical information extraction (IE) across five diverse patient cohorts, comparing against strong baselines (BERT, LLaMA‑3, DeepSeek‑R1, GPT‑4o).
- Comprehensive evaluation: Tests in‑domain performance, external validation on unseen institutions, and a low‑resource “new‑site” adaptation scenario using real notes from Yale New Haven Health.
- Open‑source implementation: Provides code and adapter checkpoints to accelerate reproducibility and downstream adoption.
Methodology
- Base Model Selection – Starts from a pre‑trained LLM (e.g., LLaMA‑3) that already exhibits strong medical reasoning.
- LoRA Adapter Insertion – Inserts low‑rank trainable matrices into each transformer layer; the original weights stay frozen. This reduces the number of trainable parameters from billions to a few megabytes per site.
- Federated Training Loop
- Each participating hospital downloads the current global adapter set.
- Local data (clinical notes) are used to fine‑tune only the adapters for a few epochs.
- Only the updated adapter deltas are uploaded back to the central server.
- Adaptive Aggregation (Fed‑MedLoRA+) – The server computes site‑specific weights based on validation loss, data size, and a measure of distribution shift, then aggregates adapters accordingly.
- Evaluation Pipeline – After each round, the global adapter is evaluated on a held‑out IE test set (entity and relation extraction) for each cohort, enabling early stopping and performance tracking.
Results & Findings
| Setting | Model | F1 (Entity) | F1 (Relation) | Communication (GB) |
|---|---|---|---|---|
| In‑domain (5 sites) | Fed‑MedLoRA | 84.2 | 78.5 | 0.12 |
| Fed‑MedLoRA+ (heterogeneous) | Fed‑MedLoRA+ | 86.1 | 80.3 | 0.13 |
| Baseline BERT‑based IE | — | 71.4 | 64.0 | 0.45 |
| LLaMA‑3 (centralized) | — | 83.5 | 77.9 | 2.3 |
| GPT‑4o (zero‑shot) | — | 78.0 | 71.2 | – |
- Communication savings: Transmitting adapters cut bandwidth by > 95 % compared with sending the full LLM.
- Heterogeneity handling: Fed‑MedLoRA+ consistently outperformed the vanilla version on cohorts with divergent note styles (e.g., pediatric vs. oncology).
- Low‑resource adaptation: When a brand‑new site with only 200 notes joined, the federated adapters boosted its IE F1 from 62 % (local fine‑tune) to 78 % after just two communication rounds.
Practical Implications
- Scalable multi‑institution collaborations – Hospitals can jointly improve a shared medical LLM without exposing PHI or needing petabyte‑scale network links.
- Rapid deployment in new clinics – A small batch of local notes is enough to “plug‑in” the global adapter, dramatically shortening time‑to‑value for AI‑assisted chart review or coding assistance.
- Cost‑effective model updates – Because only adapters are exchanged, existing on‑premise LLM deployments (e.g., via NVIDIA DGX or cloud‑based inference APIs) can stay static while still benefiting from the latest federated knowledge.
- Regulatory friendliness – The approach aligns with data‑locality requirements (e.g., HIPAA, GDPR) since raw text never leaves the institution.
Limitations & Future Work
- Adapter expressiveness – While LoRA adapters are lightweight, they may not capture all nuances needed for highly specialized tasks (e.g., rare disease phenotyping).
- Security of updates – The paper acknowledges potential model‑inversion attacks on uploaded adapters; future work should explore differential privacy or secure aggregation.
- Broader task coverage – Experiments focus on information extraction; extending to generative clinical tasks (summarization, decision support) remains an open question.
- Scalability to dozens of sites – The current study involves five institutions; testing the framework at national or international scales will be needed to validate robustness under extreme heterogeneity.
Authors
- Anran Li
- Yuanyuan Chen
- Wenjun Long
- Yu Yin
- Yan Hu
- Hyunjae Kim
- Weipeng Zhou
- Yujia Zhou
- Hongyi Peng
- Yang Ren
- Xuguang Ai
- Zhenyue Qin
- Ming Hu
- Xiaoxiao Li
- Han Yu
- Yih‑Chung Tham
- Lucila Ohno‑Machado
- Hua Xu
- Qingyu Chen
Paper Information
- arXiv ID: 2601.22124v1
- Categories: cs.CL, cs.DC
- Published: January 29, 2026
- PDF: Download PDF