[Paper] PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Published: 3 days ago (February 24, 2026 at 01:10 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.21165v1

Overview

The paper presents PVminer, a domain‑specific natural‑language‑processing (NLP) toolkit that automatically extracts the “patient voice” (PV) from large collections of patient‑generated text such as secure messages, surveys, and interview transcripts. By turning unstructured patient communication into structured, machine‑readable labels, PVminer makes it feasible for health systems to scale qualitative insights that were previously limited to labor‑intensive manual coding.

Key Contributions

Domain‑adapted BERT models (PV‑BERT‑base & PV‑BERT‑large) fine‑tuned on patient‑authored language, outperforming generic biomedical and clinical BERT variants.
Multi‑label, hierarchical classification that predicts three label levels (Code, Subcode, Combo) in a single pipeline.
Topic‑augmented representation (PV‑Topic‑BERT) that injects unsupervised topic vectors into the encoder, enriching semantic context.
Comprehensive benchmark showing F1 scores of 82.25 % (Code), 80.14 % (Subcode), and 77.87 % (Combo) against strong baselines.
Open‑source release of models, training scripts, and documentation, plus a request‑based annotated dataset for research reuse.

Methodology

Data Curation – Secure patient‑provider messages were manually annotated with a hierarchical coding scheme that captures both patient‑centered communication (PCC) categories and social determinants of health (SDoH).
Domain Adaptation – Two BERT models were further pre‑trained on the patient‑generated corpus, creating PV‑BERT‑base (12 layers) and PV‑BERT‑large (24 layers). This step teaches the model the idiosyncrasies of patient language (e.g., colloquialisms, misspellings, shorthand).
Topic Modeling – An unsupervised LDA‑style model extracts latent topics from the same corpus. The resulting topic distribution vectors are concatenated with the BERT token embeddings, forming the PV‑Topic‑BERT input.
Multi‑Task Fine‑Tuning – A shared encoder feeds three classification heads (Code, Subcode, Combo). The heads are trained jointly using a binary cross‑entropy loss for each label, allowing the model to learn inter‑label dependencies.
Inference Augmentation – During prediction, the model also incorporates the author’s identity (patient vs. provider) as a binary feature, which the authors found improves discrimination between patient‑expressed concerns and provider‑generated content.

Results & Findings

Task	F1 Score	Baseline (BioBERT)
Code (top‑level)	82.25 %	74.3 %
Subcode (mid‑level)	80.14 %	71.9 %
Combo (fine‑grained)	77.87 %	68.5 %

Ablation Study: Removing author identity drops Code F1 by ~2 pp; removing topic augmentation drops Subcode F1 by ~3 pp, confirming both components add measurable value.
Scalability: The end‑to‑end pipeline processes thousands of messages per hour on a single GPU, making it practical for health‑system‑wide deployment.

Practical Implications

Automated SDoH Extraction – Clinicians and care managers can receive real‑time alerts about housing insecurity, transportation barriers, or medication affordability directly from patient messages, enabling proactive outreach.
Quality‑Improvement Dashboards – Structured PV data can be visualized in population‑health dashboards, helping health systems track patient‑centered communication metrics across clinics.
Clinical Decision Support – Integration with EHRs could surface patient‑voice tags alongside clinical notes, giving providers richer context for shared decision‑making.
Research Acceleration – Researchers can query large corpora for specific PV themes without manual chart review, speeding up studies on health disparities and communication effectiveness.
Compliance & Documentation – Automated coding of patient‑generated content supports documentation requirements for value‑based care models that reward patient‑centered outcomes.

Limitations & Future Work

Domain Generalization – The models were trained on a single health system’s secure messaging platform; performance on other institutions, languages, or communication channels (e.g., SMS, patient portals) remains untested.
Annotation Granularity – The hierarchical code set reflects the authors’ expert taxonomy; extending or adapting it to other clinical contexts may require additional labeling effort.
Explainability – While the model outputs label probabilities, deeper interpretability (e.g., highlighting text spans that drove a specific SDoH tag) is not yet built into the pipeline.
Future Directions – The authors plan to (1) evaluate cross‑institution transfer learning, (2) incorporate multimodal data (e.g., audio interviews), and (3) develop user‑facing tools that surface highlighted excerpts for clinicians to review.

Authors

Samah Fodeh
Linhai Ma
Yan Wang
Srivani Talakokkul
Ganesh Puthiaraju
Afshan Khan
Ashley Hagaman
Sarah Lowe
Aimee Roundtree

Paper Information

arXiv ID: 2602.21165v1
Categories: cs.CL, cs.AI
Published: February 24, 2026
PDF: Download PDF

[Paper] PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

[Paper] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

[Paper] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?