[Paper] Automated Generation of Custom MedDRA Queries Using SafeTerm Medical Map

Published: 1 day ago (December 8, 2025 at 11:33 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07694v1

Overview

The paper introduces SafeTerm, an AI‑driven system that automatically builds custom MedDRA queries—standardized lists of adverse‑event terms used by regulators and pharma companies during drug safety reviews. By embedding medical terminology in a high‑dimensional vector space and ranking candidate terms with statistical similarity scores, SafeTerm can retrieve relevant MedDRA Preferred Terms (PTs) with minimal human effort, offering a fast, reproducible alternative to the labor‑intensive manual process.

Key Contributions

End‑to‑end AI pipeline that converts a free‑text safety query into a ranked list of MedDRA PTs.
Vector‑space representation of both query terms and MedDRA PTs, enabling cosine‑similarity based matching.
Extreme‑value clustering to group highly similar PTs and avoid redundancy in the generated list.
Multi‑criteria statistical scoring that balances precision and recall across a tunable similarity threshold.
Comprehensive validation against the FDA’s Office of New Drugs Custom Medical Queries (OCMQ) v3.0 (104 curated queries), reporting precision, recall, and F1 across thresholds.
Practical recommendation of a default similarity threshold (~0.60) for initial runs, with higher thresholds for tighter term selection.

Methodology

Data Preparation – The authors extracted all valid MedDRA Preferred Terms (≈ 23 k PTs) and the 104 FDA OCMQ queries, each consisting of a set of PTs curated by experts.
Embedding Generation – Using a pre‑trained biomedical language model (e.g., BioBERT or similar), each term (both query words and PTs) is transformed into a dense vector. The vectors capture semantic relationships such as synonymy and hierarchical medical concepts.
Similarity Computation – For a given input query, the system computes the cosine similarity between the query’s embedding and every PT embedding.
Extreme‑Value Clustering – PTs that are extremely close to each other (high similarity) are clustered together; only the most representative PT from each cluster is kept, reducing noise and duplication.
Scoring & Ranking – Each PT receives a relevance score based on its similarity value and cluster statistics. The PTs are then sorted from most to least relevant.
Threshold Tuning – By varying a similarity cut‑off (e.g., 0.60, 0.70, 0.75), the system can trade off recall (capturing more true PTs) against precision (reducing false positives).

The entire pipeline runs automatically, requiring only the textual description of the safety signal as input.

Results & Findings

Similarity Threshold	Recall	Precision	F1
0.60 (recommended start)	> 95 %	~ 30 %	—
0.70 – 0.75 (optimal balance)	~ 50 %	~ 33 %	~ 40 %
> 0.80 (high‑precision mode)	< 30 %	up to 86 %	—

High recall at low thresholds shows SafeTerm can retrieve almost all PTs that a human expert would include, making it a reliable safety‑net.
Precision improves sharply with higher thresholds, allowing developers to generate concise, high‑confidence term lists when needed.
Narrow‑term subsets (queries focusing on a small medical concept) behaved similarly to full queries but required slightly higher thresholds to maintain precision.

Overall, the system demonstrates that a single similarity cut‑off around 0.60 provides a solid baseline, while fine‑tuning the threshold tailors the output to specific project needs.

Practical Implications

Accelerated safety signal detection – Pharmacovigilance teams can generate draft MedDRA queries in seconds rather than days, freeing analysts to focus on interpretation rather than term hunting.
Consistent, reproducible query construction – The vector‑based approach eliminates variability caused by differing expert vocabularies, supporting regulatory audits and cross‑team collaboration.
Integration into existing pipelines – SafeTerm can be wrapped as a micro‑service (REST API) and called from data‑ingestion workflows, EHR‑based adverse‑event monitoring tools, or post‑market surveillance dashboards.
Rapid prototyping for new therapeutic areas – When a novel drug class emerges, SafeTerm can quickly suggest relevant PTs even before a domain expert has curated a full query.
Cost reduction – Automating the bulk of query generation reduces the hours of highly‑trained medical coders, translating into measurable savings for pharma and CROs.

Limitations & Future Work

Precision ceiling – Even at the highest thresholds, the system still produces a notable number of false positives, requiring a manual review step for final approval.
Dependence on embedding quality – The performance hinges on the underlying biomedical language model; newer models (e.g., PubMed‑LLM) could further improve semantic matching.
Static MedDRA version – The study used a single MedDRA release; future work should evaluate robustness across version updates.
Explainability – While cosine similarity is intuitive, providing clinicians with clear rationale (e.g., highlighted synonyms) would increase trust.
Extension to hierarchical queries – Incorporating MedDRA’s hierarchical levels (SOC, HLGT, HLT) could enable more nuanced query generation beyond flat PT lists.

By addressing these points, SafeTerm could evolve from a helpful assistant to a fully autonomous component of the drug‑safety ecosystem.

Authors

Francois Vandenhende
Anna Georgiou
Michalis Georgiou
Theodoros Psaras
Ellie Karekla
Elena Hadjicosta

Paper Information

arXiv ID: 2512.07694v1
Categories: cs.CL
Published: December 8, 2025
PDF: Download PDF

[Paper] Automated Generation of Custom MedDRA Queries Using SafeTerm Medical Map

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

[Paper] Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

[Paper] Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

[Paper] Ask, Answer, and Detect: Role-Playing LLMs for Personality Detection with Question-Conditioned Mixture-of-Experts