[Paper] A Calibrated Memorization Index (MI) for Detecting Training Data Leakage in Generative MRI Models

Published: (February 13, 2026 at 11:21 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.13066v1

Overview

Generative AI models for MRI scans can unintentionally reproduce exact copies of the images they were trained on, raising serious privacy concerns for patients and healthcare providers. This paper introduces a Calibrated Memorization Index (MI)—a per‑sample metric that reliably flags when a generated image is a near‑duplicate of training data, helping developers audit and mitigate data leakage in medical image synthesis pipelines.

Key Contributions

  • Per‑sample leakage detector: A calibrated metric (ONI/MI) that quantifies how “overfit” or “novel” each generated MRI is.
  • Foundation‑model features: Leverages embeddings from a state‑of‑the‑art MRI foundation model, ensuring the metric works on clinically relevant visual cues rather than raw pixels.
  • Multi‑layer whitening & nearest‑neighbor aggregation: Combines similarity scores across several network layers to capture both low‑level texture and high‑level anatomical similarity.
  • Robust cross‑dataset performance: Validated on three diverse MRI datasets with varying duplication rates and common augmentations (cropping, intensity scaling, etc.).
  • Near‑perfect duplicate detection: Achieves >99 % true‑positive rate for identifying exact or near‑exact copies at the individual‑sample level.

Methodology

  1. Feature extraction – Each MRI (both real training images and generated outputs) is passed through a pre‑trained MRI foundation model (e.g., a Vision Transformer tuned on large‑scale brain scans). The model provides a hierarchy of feature maps.
  2. Whitening – For each layer, features are whitened (zero‑mean, unit‑variance) to neutralize scale differences and make distances comparable across layers.
  3. Nearest‑neighbor similarity – The Euclidean distance between a generated sample’s whitened features and its closest training‑sample counterpart is computed per layer.
  4. Aggregation – Layer‑wise similarities are combined (weighted average) into a single similarity score.
  5. Calibration to ONI/MI – The raw similarity is mapped onto a bounded scale:
    • Overfit/Novelty Index (ONI): 0 = completely novel, 1 = identical to a training image.
    • Memorization Index (MI): A calibrated version of ONI that accounts for dataset‑level baseline similarity, yielding a more interpretable “how much memorization” value.

The pipeline runs in linear time with respect to the number of generated samples and can be integrated into existing training loops or post‑hoc audit tools.

Results & Findings

Dataset (duplication %)ONI/MI detection accuracy (sample‑level)False‑positive rate (novel samples)
Brain‑MRI (0 % dup)99.8 % (near‑perfect)<1 %
Knee‑MRI (10 % dup)99.5 %1.2 %
Whole‑body MRI (30 % dup)99.9 %0.8 %
  • Consistency across augmentations: Even when training images were randomly rotated, intensity‑scaled, or cropped, the MI remained stable, showing that the metric captures true semantic duplication rather than superficial pixel similarity.
  • Dataset‑agnostic calibration: By normalizing against each dataset’s similarity distribution, the MI values are comparable across different anatomical sites and scanner protocols.

Practical Implications

  • Privacy compliance: Hospitals and AI vendors can embed the MI check into their model release pipelines to certify that generated MRIs do not leak patient data, satisfying GDPR, HIPAA, and other regulations.
  • Model debugging: Developers can pinpoint which training samples are being memorized, enabling targeted data augmentation or regularization (e.g., stronger dropout, differential privacy).
  • Synthetic data marketplaces: Platforms that sell or share synthetic medical images can provide an MI‑based “privacy score” for each batch, building trust with buyers.
  • Regulatory audit trails: The per‑sample ONI/MI logs can serve as evidence in audits, showing exactly which outputs were flagged and why.

Limitations & Future Work

  • Dependence on a high‑quality foundation model: The metric’s reliability hinges on the representational power of the underlying MRI encoder; poorly trained encoders could miss subtle memorization.
  • Scalability to massive datasets: While linear in the number of generated samples, nearest‑neighbor search across millions of training images may require approximate methods (e.g., FAISS) to stay performant.
  • Extension beyond MRI: The authors note that adapting the pipeline to other modalities (CT, histopathology) will need modality‑specific foundation models and calibration studies.
  • Adversarial duplication: Future work could explore whether malicious actors can deliberately craft near‑duplicates that evade the MI, prompting research into more robust similarity measures.

Authors

  • Yash Deo
  • Yan Jia
  • Toni Lassila
  • Victoria J Hodge
  • Alejandro F Frang
  • Chenghao Qian
  • Siyuan Kang
  • Ibrahim Habli

Paper Information

  • arXiv ID: 2602.13066v1
  • Categories: cs.CV
  • Published: February 13, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »