[Paper] PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
Source: arXiv - 2602.21165v1
Overview
The paper presents PVminer, a domain‑specific natural‑language‑processing (NLP) toolkit that automatically extracts the “patient voice” (PV) from large collections of patient‑generated text such as secure messages, surveys, and interview transcripts. By turning unstructured patient communication into structured, machine‑readable labels, PVminer makes it feasible for health systems to scale qualitative insights that were previously limited to labor‑intensive manual coding.
Key Contributions
- Domain‑adapted BERT models (PV‑BERT‑base & PV‑BERT‑large) fine‑tuned on patient‑authored language, outperforming generic biomedical and clinical BERT variants.
- Multi‑label, hierarchical classification that predicts three label levels (Code, Subcode, Combo) in a single pipeline.
- Topic‑augmented representation (PV‑Topic‑BERT) that injects unsupervised topic vectors into the encoder, enriching semantic context.
- Comprehensive benchmark showing F1 scores of 82.25 % (Code), 80.14 % (Subcode), and 77.87 % (Combo) against strong baselines.
- Open‑source release of models, training scripts, and documentation, plus a request‑based annotated dataset for research reuse.
Methodology
- Data Curation – Secure patient‑provider messages were manually annotated with a hierarchical coding scheme that captures both patient‑centered communication (PCC) categories and social determinants of health (SDoH).
- Domain Adaptation – Two BERT models were further pre‑trained on the patient‑generated corpus, creating PV‑BERT‑base (12 layers) and PV‑BERT‑large (24 layers). This step teaches the model the idiosyncrasies of patient language (e.g., colloquialisms, misspellings, shorthand).
- Topic Modeling – An unsupervised LDA‑style model extracts latent topics from the same corpus. The resulting topic distribution vectors are concatenated with the BERT token embeddings, forming the PV‑Topic‑BERT input.
- Multi‑Task Fine‑Tuning – A shared encoder feeds three classification heads (Code, Subcode, Combo). The heads are trained jointly using a binary cross‑entropy loss for each label, allowing the model to learn inter‑label dependencies.
- Inference Augmentation – During prediction, the model also incorporates the author’s identity (patient vs. provider) as a binary feature, which the authors found improves discrimination between patient‑expressed concerns and provider‑generated content.
Results & Findings
| Task | F1 Score | Baseline (BioBERT) |
|---|---|---|
| Code (top‑level) | 82.25 % | 74.3 % |
| Subcode (mid‑level) | 80.14 % | 71.9 % |
| Combo (fine‑grained) | 77.87 % | 68.5 % |
- Ablation Study: Removing author identity drops Code F1 by ~2 pp; removing topic augmentation drops Subcode F1 by ~3 pp, confirming both components add measurable value.
- Scalability: The end‑to‑end pipeline processes thousands of messages per hour on a single GPU, making it practical for health‑system‑wide deployment.
Practical Implications
- Automated SDoH Extraction – Clinicians and care managers can receive real‑time alerts about housing insecurity, transportation barriers, or medication affordability directly from patient messages, enabling proactive outreach.
- Quality‑Improvement Dashboards – Structured PV data can be visualized in population‑health dashboards, helping health systems track patient‑centered communication metrics across clinics.
- Clinical Decision Support – Integration with EHRs could surface patient‑voice tags alongside clinical notes, giving providers richer context for shared decision‑making.
- Research Acceleration – Researchers can query large corpora for specific PV themes without manual chart review, speeding up studies on health disparities and communication effectiveness.
- Compliance & Documentation – Automated coding of patient‑generated content supports documentation requirements for value‑based care models that reward patient‑centered outcomes.
Limitations & Future Work
- Domain Generalization – The models were trained on a single health system’s secure messaging platform; performance on other institutions, languages, or communication channels (e.g., SMS, patient portals) remains untested.
- Annotation Granularity – The hierarchical code set reflects the authors’ expert taxonomy; extending or adapting it to other clinical contexts may require additional labeling effort.
- Explainability – While the model outputs label probabilities, deeper interpretability (e.g., highlighting text spans that drove a specific SDoH tag) is not yet built into the pipeline.
- Future Directions – The authors plan to (1) evaluate cross‑institution transfer learning, (2) incorporate multimodal data (e.g., audio interviews), and (3) develop user‑facing tools that surface highlighted excerpts for clinicians to review.
Authors
- Samah Fodeh
- Linhai Ma
- Yan Wang
- Srivani Talakokkul
- Ganesh Puthiaraju
- Afshan Khan
- Ashley Hagaman
- Sarah Lowe
- Aimee Roundtree
Paper Information
- arXiv ID: 2602.21165v1
- Categories: cs.CL, cs.AI
- Published: February 24, 2026
- PDF: Download PDF