[Paper] A Dataset and Benchmark for Consumer Healthcare Question Summarization
Source: arXiv - 2512.23637v1
Overview
The paper introduces CHQ‑Sum, a newly curated dataset of 1,507 consumer health questions paired with expert‑written concise summaries. By providing a high‑quality, domain‑expert annotated benchmark, the authors aim to accelerate research on automatically summarizing noisy, user‑generated health queries—a task that’s essential for building smarter health‑assistants, search engines, and triage bots.
Key Contributions
- CHQ‑Sum dataset: 1,507 real‑world consumer health questions from a community Q&A forum, each annotated with a succinct, medically accurate summary by domain experts.
- Comprehensive benchmark: Evaluation of several state‑of‑the‑art abstractive summarization models (e.g., BART, T5, PEGASUS) on the new dataset, establishing baseline performance numbers.
- Analysis of domain challenges: Detailed error analysis highlighting why consumer health questions are harder to summarize than generic text (e.g., jargon, irrelevant details, ambiguous phrasing).
- Open‑source release: The dataset, preprocessing scripts, and evaluation code are publicly released, encouraging reproducibility and further research.
Methodology
- Data collection – The authors scraped consumer health questions from a popular community question‑answering platform, filtering for posts that contain a clear medical intent.
- Expert annotation – Trained medical professionals rewrote each question into a short, information‑dense summary (≈30‑40 words) that captures the core health concern while discarding extraneous storytelling.
- Pre‑processing – Text was normalized (tokenization, de‑identification) and split into train/validation/test splits (80/10/10).
- Model benchmarking – Four transformer‑based abstractive summarizers (BART‑large, T5‑base, PEGASUS‑large, and a fine‑tuned Longformer‑Encoder‑Decoder) were trained on the training split. Standard metrics (ROUGE‑1/2/L, BERTScore) were used for evaluation, complemented by human assessments of medical correctness.
The pipeline is deliberately simple so that developers can reproduce the results with a single GPU and adapt the code to other health‑related summarization tasks.
Results & Findings
| Model | ROUGE‑1 | ROUGE‑2 | ROUGE‑L | BERTScore |
|---|---|---|---|---|
| BART‑large | 38.2 | 15.7 | 35.9 | 0.84 |
| T5‑base | 36.5 | 14.9 | 34.1 | 0.82 |
| PEGASUS‑large | 40.1 | 16.4 | 37.2 | 0.86 |
| Longformer‑LED | 37.8 | 15.2 | 35.5 | 0.83 |
- PEGASUS‑large achieved the best ROUGE scores, confirming that models pre‑trained on large summarization corpora transfer well to the health domain.
- Human evaluation revealed that while models often produce fluent summaries, medical accuracy remains a bottleneck: ~30 % of generated summaries omitted or mis‑represented a key symptom or condition.
- Error analysis showed that models struggle most with overly verbose questions and with implicit medical terminology (e.g., “feeling off” → “dysphoria”).
Practical Implications
- Improved health chatbots: Integrating a fine‑tuned summarizer can condense user‑provided symptom narratives into concise, structured inputs for downstream diagnosis or triage modules.
- Search & retrieval: Summarized queries enable more precise indexing and ranking in consumer health search engines, reducing noise from storytelling.
- Clinical decision support: Summaries can be automatically attached to patient‑generated health data (e.g., portal messages), helping clinicians quickly grasp the core issue.
- Regulatory compliance: By stripping personally identifiable details while preserving medical intent, summarization can aid in anonymizing data for research or AI model training.
Developers can start by fine‑tuning PEGASUS or BART on CHQ‑Sum, then plug the model into existing pipelines (e.g., using Hugging Face Transformers) with minimal engineering overhead.
Limitations & Future Work
- Dataset size: Although high‑quality, 1.5 k examples are modest compared to generic summarization corpora, limiting the ability to train very large models from scratch.
- Domain scope: The questions are sourced from a single community forum, which may not capture the full linguistic diversity of global consumer health queries (e.g., non‑English, low‑literacy users).
- Medical correctness: Current models still make factual errors; future work should explore fact‑checking or knowledge‑grounded generation using medical ontologies (e.g., UMLS).
- Multi‑turn context: Many health inquiries involve follow‑up questions; extending the benchmark to multi‑turn dialogues is a promising direction.
By addressing these gaps, the community can move toward robust, trustworthy summarization tools that truly empower both developers and end‑users in the consumer health space.
Authors
- Abhishek Basu
- Deepak Gupta
- Dina Demner‑Fushman
- Shweta Yadav
Paper Information
- arXiv ID: 2512.23637v1
- Categories: cs.CL
- Published: December 29, 2025
- PDF: Download PDF