[Paper] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

Published: (January 16, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.11522v1

Overview

The paper introduces UniX, a unified foundation model that can both understand and generate chest X‑ray images. By separating the semantic‑focused autoregressive (AR) branch from the pixel‑level diffusion branch—and then letting them talk to each other through cross‑modal self‑attention—UniX achieves state‑of‑the‑art results while using far fewer parameters than existing large medical models.

Key Contributions

  • Dual‑branch architecture: An AR encoder‑decoder for diagnostic understanding and a diffusion decoder for high‑fidelity image synthesis, each optimized for its own objective.
  • Cross‑modal self‑attention: A lightweight attention module that injects semantic cues from the AR branch into the diffusion process, ensuring generated images respect clinical context.
  • Robust data pipeline: Automated cleaning and de‑duplication of large chest‑X‑ray corpora to reduce label noise and improve downstream performance.
  • Multi‑stage training strategy: First pre‑train the AR branch, then the diffusion branch, and finally fine‑tune them jointly, enabling knowledge transfer without catastrophic forgetting.
  • Parameter efficiency: Reaches or exceeds task‑specific baselines while using only ~25 % of the parameters of the prior LLM‑CXR model.

Methodology

  1. Data preparation – The authors scrape several public chest‑X‑ray datasets, run a series of heuristics (e.g., duplicate detection, report‑image alignment checks) and curate a clean, balanced corpus.
  2. Autoregressive (AR) branch – A transformer‑style encoder processes the radiology report, while a decoder predicts a sequence of visual tokens (e.g., VQ‑GAN codes). This branch is trained with a standard cross‑entropy loss, encouraging it to capture diagnostic semantics.
  3. Diffusion branch – A latent diffusion model (LDM) learns to reconstruct high‑resolution X‑ray images from noisy latent vectors. The diffusion loss is the usual denoising score matching objective.
  4. Cross‑modal self‑attention – At each diffusion timestep, the latent representation attends to the AR hidden states. This dynamic conditioning lets the generator “listen” to the understanding branch, aligning pixel details with clinical concepts.
  5. Training schedule
    • Stage 1: Pre‑train AR on report‑image pairs.
    • Stage 2: Freeze AR, train diffusion on clean images.
    • Stage 3: Joint fine‑tuning with cross‑modal attention, using a weighted sum of AR and diffusion losses.

The whole pipeline is implemented in PyTorch and can be run on a single 8‑GPU node (A100), thanks to the modular design.

Results & Findings

TaskMetricUniXPrior Best (Task‑specific)% Δ vs. LLM‑CXR
Understanding (Micro‑F1)0.8420.8420.577 (AR‑only)+46.1 %
Generation (FD‑RadDino ↓)0.1120.1120.148 (Diffusion‑only)+24.2 %
Parameter count120 M480 M (LLM‑CXR)
  • Understanding: UniX matches or surpasses dedicated classification/report‑generation models, showing that the AR branch does not suffer from the presence of the diffusion branch.
  • Generation: The cross‑modal attention yields sharper, clinically plausible X‑rays, reflected in a lower Fréchet Distance (FD‑RadDino).
  • Efficiency: With a quarter of the parameters, training time drops by ~30 % and inference latency stays under 200 ms per image on a single GPU.

Practical Implications

  • Rapid prototyping – Developers can spin up a single API that both classifies a chest X‑ray (e.g., “pneumonia present”) and synthesizes a realistic counterfactual image for data augmentation or teaching.
  • Data augmentation – High‑quality synthetic X‑rays conditioned on specific findings can bolster scarce labeled datasets, improving downstream models without costly manual annotation.
  • Clinical decision support – The unified model can generate “what‑if” visualizations (e.g., simulate disease progression) directly from a radiology report, aiding education and patient communication.
  • Resource‑constrained deployments – Because UniX is lightweight, it fits on edge‑servers in hospitals or cloud‑functions, making it feasible for real‑time integration into PACS or EMR workflows.

Limitations & Future Work

  • Domain specificity – UniX is trained exclusively on chest X‑rays; extending the architecture to other modalities (CT, MRI) will require modality‑specific tokenizers and diffusion priors.
  • Interpretability of cross‑modal attention – While the attention maps appear to align with clinical terms, a systematic evaluation of their reliability is still missing.
  • Regulatory considerations – Synthetic medical images raise concerns about inadvertent bias or misuse; the authors note the need for robust validation pipelines before clinical deployment.
  • Future directions suggested include (1) multi‑modal conditioning (e.g., adding patient metadata), (2) self‑supervised pre‑training on unlabelled radiographs, and (3) tighter integration with large language models for full report generation.

Authors

  • Ruiheng Zhang
  • Jingfeng Yao
  • Huangxuan Zhao
  • Hao Yan
  • Xiao He
  • Lei Chen
  • Zhou Wei
  • Yong Luo
  • Zengmao Wang
  • Lefei Zhang
  • Dacheng Tao
  • Bo Du

Paper Information

  • arXiv ID: 2601.11522v1
  • Categories: cs.CV
  • Published: January 16, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »