[Paper] Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations

Published: 3 days ago (February 27, 2026 at 01:50 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.24278v1

Overview

The paper “Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations” uncovers a hidden flaw in the way the machine‑learning community measures whether a representation learning model has truly recovered the underlying factors of variation. The authors demonstrate that popular evaluation metrics (MCC, DCI, R², etc.) only give reliable answers under very specific assumptions about the data‑generating process and the geometry of the encoder—assumptions that are often violated in real‑world settings. When those assumptions break, the metrics can misleadingly claim success (false positives) or miss genuine recovery (false negatives).

Key Contributions

Critical analysis of existing identifiability metrics – Shows how each metric implicitly encodes hidden assumptions about the data and the encoder.
Taxonomy of assumptions – Separates Data‑Generating Process (DGP) assumptions from Encoder Geometry assumptions, providing a clear map of where each metric is valid.
Stress‑testing framework – Releases an open‑source evaluation suite that systematically perturbs synthetic benchmarks to expose metric misspecifications.
Empirical evidence of systematic failures – Demonstrates false‑positive and false‑negative cases both in classic identifiability regimes and in post‑hoc (unsupervised) settings where reliable evaluation is most needed.
Guidelines for practitioners – Offers practical recommendations on selecting or designing metrics that align with the specific assumptions of a given problem.

Methodology

Formalizing Metric Assumptions
- The authors decompose each metric into two components: (a) DGP assumptions (e.g., linearity, independence of factors, noise distribution) and (b) Encoder geometry assumptions (e.g., invertibility, orthogonality).
- They prove mathematically that a metric’s guarantee of “identifiability up to equivalence” holds only when both sets of assumptions are satisfied.
Taxonomy Construction
- By cataloguing common synthetic benchmarks (e.g., dSprites, 3D Shapes) and popular metrics, they build a matrix that shows which combinations of DGP and encoder properties each metric can tolerate.
Stress‑Testing Suite
- The suite generates synthetic datasets where a single assumption is deliberately broken (e.g., adding correlated noise, using a non‑linear mixing function, or training an encoder with a non‑invertible architecture).
- For each perturbed dataset, they compute the standard metrics and compare them against a ground‑truth “oracle” that knows the true latent factors.
Empirical Evaluation
- They run a battery of experiments across multiple representation‑learning models (VAE, β‑VAE, InfoGAN, contrastive methods) and record where metrics diverge from the oracle.

Results & Findings

Metric	Assumption(s) Required	Observed Failure Mode
MCC (Maximum Correlation Coefficient)	Linear mixing, independent factors, full‑rank encoder	False positives when factors are non‑linearly mixed but encoder is still linear.
DCI (Disentanglement‑Completeness‑Informativeness)	Factor independence, axis‑aligned latent space	False negatives when encoder rotates the latent space (still identifiable up to rotation).
R² (Explained variance)	Gaussian noise, linear decoder	Systematic over‑estimation when noise is heavy‑tailed.
HSIC‑based metrics	No specific DGP but require kernel smoothness	Break down with discrete latent factors.

False Positives: In several post‑hoc scenarios (e.g., contrastive encoders trained on corrupted data), MCC reported near‑perfect recovery even though the learned representation was provably non‑identifiable.
False Negatives: DCI often penalized models that performed a simple orthogonal rotation of the true factors—an operation that is allowed under identifiability theory but not captured by DCI’s axis‑alignment bias.
Robustness Gap: No single existing metric remained reliable across all tested perturbations; each had a narrow validity domain.

Practical Implications

Metric Selection Becomes a Design Decision: Developers can no longer treat MCC, DCI, or R² as plug‑and‑play diagnostics. Instead, they must first verify that their data and model satisfy the metric’s hidden assumptions.
Better Benchmark Design: When constructing synthetic benchmarks for representation learning, practitioners should deliberately vary DGP properties (e.g., introduce factor correlations, non‑linear mixing) to ensure that claimed improvements are not artifacts of metric misspecification.
Model Debugging: The taxonomy helps pinpoint why a metric is misbehaving—e.g., a low DCI score may simply indicate a rotated latent space rather than a failure to learn the factors.
Tooling Integration: The released stress‑testing suite can be incorporated into CI pipelines for representation‑learning libraries (e.g., torchdisentangle, scikit‑learn), automatically flagging when a chosen metric is out of its validity domain.
Guidance for Post‑hoc Identifiability: In downstream tasks (fairness, causal inference) where we rely on post‑hoc identifiability checks, the paper warns that current metrics may give a false sense of security, urging the community to develop more assumption‑agnostic evaluation methods.

Limitations & Future Work

Synthetic Focus: All experiments are on controlled synthetic data; real‑world datasets (e.g., medical imaging, sensor streams) may exhibit more complex violations that were not explored.
Metric Scope: The study concentrates on a handful of widely used metrics; newer or domain‑specific measures (e.g., mutual information estimators) remain unexamined.
Encoder Diversity: While several encoder families were tested, the analysis does not cover recent transformer‑based or graph‑neural encoders that may have distinct geometric properties.
Future Directions: The authors propose extending the taxonomy to probabilistic identifiability criteria, designing robust metrics that adapt to unknown DGP properties, and validating the framework on large‑scale real datasets.

Bottom line: The paper shines a light on a hidden source of bias in how we evaluate representation learning models. By making the assumptions of each metric explicit and providing tools to stress‑test them, it gives developers a practical roadmap to avoid misleading conclusions and to build more trustworthy, truly identifiable systems.

Authors

Shruti Joshi
Théo Saulus
Wieland Brendel
Philippe Brouillard
Dhanya Sridhar
Patrik Reizinger

Paper Information

arXiv ID: 2602.24278v1
Categories: cs.LG
Published: February 27, 2026
PDF: Download PDF

[Paper] Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Mode Seeking meets Mean Seeking for Fast Long Video Generation

[Paper] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

[Paper] Do LLMs Benefit From Their Own Words?

[Paper] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation