[Paper] AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation
Source: arXiv - 2601.03191v1
Overview
AnatomiX is a new multimodal large language model (LLM) that couples visual understanding of chest X‑rays with anatomical awareness. By explicitly grounding its reasoning in the anatomy of the thorax, the model delivers far more reliable interpretations—crucial for clinical decision support and downstream AI tools that need to “know where” a finding is located.
Key Contributions
- Anatomy‑aware two‑stage pipeline – first detects and extracts features from specific thoracic structures, then feeds these representations to a language model for downstream tasks.
- Unified multitask framework – supports phrase grounding, report generation, visual question answering (VQA), and image understanding with a single model.
- State‑of‑the‑art grounding performance – achieves >25 % relative gains on anatomy grounding, phrase grounding, grounded diagnosis, and grounded captioning benchmarks versus prior multimodal medical LLMs.
- Open‑source release – code and pretrained weights are publicly available, enabling reproducibility and rapid adoption by the community.
Methodology
-
Anatomical Structure Identification
- A dedicated vision encoder (e.g., a CNN or ViT) processes the chest X‑ray and produces region proposals for key anatomical parts (lungs, heart, ribs, mediastinum, etc.).
- A lightweight classifier refines these proposals, yielding a set of anatomy tokens each paired with a visual embedding.
-
Feature Extraction & Fusion
- The visual embeddings are projected into the same latent space as the language model’s token embeddings.
- A cross‑modal attention layer lets the LLM attend selectively to the anatomy tokens when generating text or answering questions.
-
Task Heads
- Phrase Grounding: aligns medical phrases (e.g., “right lower lobe opacity”) with the corresponding anatomy token.
- Report Generation: conditions the language model on the ordered anatomy tokens to produce structured radiology reports.
- VQA / Image Understanding: interprets natural‑language queries by attending to the relevant anatomical region before producing an answer.
The entire system is trained end‑to‑end on a mixture of publicly available chest X‑ray datasets (e.g., MIMIC‑CXR, CheXpert) with supervision for both visual grounding and language generation.
Results & Findings
- Anatomy Grounding: 78 % accuracy (vs. 62 % for the strongest baseline).
- Phrase Grounding: 71 % IoU‑based score, a 27 % relative improvement.
- Grounded Diagnosis: 84 % F1 on disease classification when the model is forced to cite the responsible anatomy, surpassing baseline by 25 %.
- Grounded Captioning: BLEU‑4 score of 0.38, beating prior methods by >0.1 points while also providing explicit region tags.
These numbers indicate that AnatomiX not only predicts the right findings but also correctly localizes them—a critical step toward trustworthy AI in radiology.
Practical Implications
- Clinical Decision Support: Radiologists can receive AI‑generated reports that explicitly reference anatomical locations, reducing ambiguity and easing verification.
- Regulatory Compliance: Grounded explanations satisfy emerging requirements for “explainable AI” in medical software, making it easier to obtain FDA or CE clearance.
- Developer Tooling: The open‑source model can be integrated into PACS viewers, tele‑radiology platforms, or research pipelines to add anatomy‑aware VQA or automated report drafting with minimal engineering effort.
- Data Annotation: The anatomy detection stage can be repurposed as a semi‑automatic annotator, accelerating the creation of labeled datasets for other thoracic imaging tasks.
Limitations & Future Work
- Dataset Bias: Training relies heavily on publicly available chest X‑ray corpora, which may under‑represent rare pathologies or pediatric cases.
- Resolution Constraints: The visual encoder operates on down‑sampled images (≈224×224), potentially missing fine‑grained details such as subtle interstitial patterns.
- Generalization to Other Modalities: While the pipeline is designed for chest X‑rays, extending it to CT, MRI, or ultrasound will require new anatomy token definitions and possibly larger visual backbones.
- Future Directions: The authors plan to (1) incorporate higher‑resolution feature maps, (2) explore self‑supervised anatomy discovery to reduce reliance on annotated masks, and (3) evaluate the model in prospective clinical workflows to measure real‑world impact.
Authors
- Anees Ur Rehman Hashmi
- Numan Saeed
- Christoph Lippert
Paper Information
- arXiv ID: 2601.03191v1
- Categories: cs.CV, cs.AI, cs.LG
- Published: January 6, 2026
- PDF: Download PDF