[Paper] MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

Published: 3 weeks ago (April 16, 2026 at 12:28 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.15203v1

Overview

The MADE benchmark tackles a core problem for AI in healthcare: automatically classifying free‑text medical device adverse event (MDAE) reports while also telling us how confident the model is about each prediction. By continuously ingesting newly published reports, MADE stays “alive,” preventing the data‑leak problems that have plagued older text‑classification benchmarks. The paper not only releases a challenging multi‑label dataset but also provides a systematic comparison of over 20 modern language models and uncertainty‑quantification (UQ) techniques.

Key Contributions

A living, temporally‑split benchmark built from real‑world MDAE reports, featuring a long‑tailed hierarchy of > 1 000 labels.
Strict temporal train/validation/test splits that eliminate accidental test‑set contamination and mimic real deployment scenarios.
Comprehensive baseline suite: > 20 encoder‑only and decoder‑only models evaluated under full fine‑tuning, few‑shot, and instruction‑tuned (reasoning) regimes.
Systematic UQ evaluation: entropy‑based, consistency‑based, and self‑verbalized confidence methods are benchmarked side‑by‑side.
Empirical insights into the trade‑offs between label coverage (head vs. tail), model size, fine‑tuning style, and the reliability of uncertainty estimates.
Open‑source release of data, code, and a web‑demo (https://hhi.fraunhofer.de/aml-demonstrator/made‑benchmark) for reproducibility and community extensions.

Methodology

Data collection & curation – The authors scrape FDA‑MAUDE adverse event reports, extract the free‑text narrative, and map each report to a set of hierarchical MedDRA (Medical Dictionary for Regulatory Activities) codes. The label distribution follows a classic long‑tail: a few common device‑issue combos (the “head”) and thousands of rare ones (the “tail”).
Living benchmark pipeline – A scheduled crawler adds new reports every month, automatically re‑splits the data using a temporal cutoff (e.g., all reports before Jan 2023 for training, Jan 2023–Jun 2023 for validation, after Jun 2023 for test). This ensures that models never see future information during training.
Model families –
- Encoder‑only (BERT, RoBERTa, DeBERTa, etc.) fine‑tuned with a sigmoid‑cross‑entropy head for multi‑label output.
- Decoder‑only (GPT‑2/3, LLaMA, Falcon) fine‑tuned to generate a comma‑separated list of labels.
- Instruction‑tuned variants (e.g., Flan‑T5, Claude) evaluated in few‑shot mode with prompts that ask the model to “list all applicable adverse event codes.”
Uncertainty quantification – Three families are compared:
- Entropy of the sigmoid output distribution (higher entropy → higher uncertainty).
- Consistency across multiple stochastic forward passes (Monte‑Carlo dropout or ensemble voting).
- Self‑verbalized confidence where the model is asked to output a confidence phrase (“I am 90 % sure”).
Metrics – Standard multi‑label scores (micro‑F1, macro‑F1, label‑wise AUC) plus UQ calibration (expected calibration error, reliability diagrams) and coverage‑accuracy curves (how accuracy changes when we only keep predictions below a given uncertainty threshold).

Results & Findings

Setting	Head‑label accuracy (micro‑F1)	Tail‑label accuracy (macro‑F1)	Best UQ calibration (ECE)
Small discriminative decoder (e.g., GPT‑2‑small, fine‑tuned)	0.78	0.42	0.12
Large discriminative encoder (e.g., DeBERTa‑xxlarge)	0.74	0.38	0.09
Generative fine‑tuned decoder (e.g., LLaMA‑7B)	0.71	0.45	0.07
Instruction‑tuned reasoning model (few‑shot)	0.68	0.51	0.15
Self‑verbalized confidence	–	–	0.20 (worst)

Takeaways

Fine‑tuned decoders excel at overall accuracy, especially on frequent (head) labels, while still offering decent uncertainty estimates.
Generative fine‑tuning (training the model to output label lists) yields the most well‑calibrated uncertainties, even if raw accuracy is slightly lower.
Reasoning‑oriented few‑shot models improve recall on rare (tail) labels but surprisingly produce over‑confident predictions, hurting calibration.
Self‑verbalized confidence does not correlate with true uncertainty; the model’s natural language confidence statements are unreliable proxies.

Practical Implications

Regulatory monitoring pipelines can plug a MADE‑trained decoder model into their ingestion workflow, automatically tagging new adverse event reports and flagging high‑uncertainty cases for human review.
Active learning loops become feasible: the calibration curves show that discarding predictions above a certain uncertainty threshold retains > 85 % of head‑label accuracy while dramatically reducing false positives on rare events.
Model selection guidance – If your product needs high recall on rare device failures (e.g., early‑warning safety systems), a large reasoning model in few‑shot mode may be worth the extra calibration work. For steady, well‑calibrated triage of incoming reports, a fine‑tuned decoder (GPT‑2‑small/medium) is a sweet spot.
Continuous benchmarking – Because MADE updates automatically, organizations can track model drift over time and re‑train only when performance on the newest batch degrades, saving compute and annotation costs.
Open‑source tooling – The released evaluation scripts integrate with Hugging Face 🤗 Transformers, making it trivial for dev teams to benchmark their own proprietary models against the baseline suite.

Limitations & Future Work

Domain specificity – MADE focuses on FDA device reports; transferability to other medical text domains (e.g., clinical notes, pharmacovigilance) is not yet validated.
Label hierarchy depth – While hierarchical MedDRA codes are provided, the current baselines treat them as flat multi‑labels; exploiting the hierarchy (e.g., hierarchical loss) could boost tail performance.
Scalability of few‑shot prompting – Large reasoning models require expensive API calls; future work could explore lightweight adapters or LoRA fine‑tuning to retain reasoning benefits without prohibitive cost.
Uncertainty methods – Only entropy, consistency, and self‑verbalized confidence were examined; Bayesian neural nets, deep ensembles, or test‑time augmentation remain open avenues.
Human‑in‑the‑loop studies – The paper stops at quantitative calibration; user studies measuring how clinicians interact with uncertainty scores would solidify real‑world impact.

Bottom line: MADE offers a realistic, continuously refreshed playground for anyone building AI that reads medical device safety reports. Its thorough evaluation of both performance and uncertainty equips developers with concrete guidance on which model families to adopt, how to handle rare events, and where to focus future research. Happy hacking!

Authors

Raunak Agarwal
Markus Wenzel
Simon Baur
Jonas Zimmer
George Harvey
Jackie Ma

Paper Information

arXiv ID: 2604.15203v1
Categories: cs.CL
Published: April 16, 2026
PDF: Download PDF

[Paper] MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text