[Paper] Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Published: 3 days ago (February 27, 2026 at 01:32 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.24264v1

Overview

A new study tackles a fundamental question for AI systems that work with images: what must a model’s internal representations look like to reliably recognize familiar parts in never‑seen combinations? By formalizing the geometric properties that enable compositional generalization, the authors show that embeddings need to be linear and orthogonal across concepts. Their theory bridges a gap between abstract cognitive desiderata and the concrete behavior of today’s large‑scale vision models such as CLIP and DINO.

Key Contributions

Three formal desiderata (divisibility, transferability, stability) that any compositional system should satisfy under standard supervised training.
Proof that these desiderata force embeddings to decompose linearly into per‑concept vectors that are mutually orthogonal.
Derivation of dimension bounds linking the number of composable concepts to the minimal embedding size required.
Empirical validation on state‑of‑the‑art vision encoders (CLIP, SigLIP, DINO), demonstrating partial linear factorization and near‑orthogonal concept subspaces.
Correlation analysis showing that the degree of linear‑orthogonal structure predicts performance on held‑out compositional tasks.
Open‑source code for reproducing the experiments and probing other models.

Methodology

Formalizing compositionality – The authors define three intuitive properties:
- Divisibility: a representation of a composite image should be expressible as a sum of its parts.
- Transferability: the same part representation should work across different contexts.
- Stability: small perturbations (e.g., lighting) should not break the decomposition.
Geometric analysis – Using linear algebra, they prove that any embedding satisfying the three properties must be a linear combination of orthogonal concept vectors. In other words, each visual concept occupies its own axis in the high‑dimensional space.
Dimension bound derivation – By counting the number of independent concepts, they obtain a lower bound on the embedding dimensionality needed for perfect compositionality.
Empirical probing – For each pretrained vision model they:
- Construct a dataset of images that systematically combine a set of visual primitives (shapes, colors, textures).
- Fit a linear factor model (e.g., PCA + regression) to extract per‑concept subspaces.
- Measure orthogonality (cosine similarity between subspaces) and rank (how many dimensions each concept actually uses).
- Evaluate compositional generalization on held‑out combinations and compute the correlation with the measured geometric metrics.

All steps are implemented in PyTorch and the analysis scripts are publicly released.

Results & Findings

Model	Linear factorization (≈ % variance explained)	Orthogonality (average cosine)	Compositional test accuracy
CLIP‑ViT‑B/32	78 % (≈ 150 dim effective)	0.12 (near‑orthogonal)	71 %
SigLIP‑ViT‑L/14	84 % (≈ 210 dim)	0.08	78 %
DINO‑ViT‑S/14	65 % (≈ 120 dim)	0.19	63 %

Partial linear factorization: All models exhibit a low‑rank structure where a relatively small subset of dimensions captures most of the variance associated with each visual concept.
Near‑orthogonal concept subspaces: The cosine similarity between different concept directions is close to zero, confirming the orthogonality prediction.
Predictive power: The stronger the linear‑orthogonal structure, the higher the model’s accuracy on unseen compositional combinations (Pearson r ≈ 0.73 across models).
Scaling trend: Larger models tend to move closer to the ideal geometry, suggesting that as we scale up data and parameters, embeddings may converge to the theoretically optimal linear‑orthogonal form.

Practical Implications

Model diagnostics: Developers can now probe a vision encoder’s compositional readiness by measuring linear factorization and orthogonality, offering a quick health check before deploying in downstream tasks that require systematic generalization (e.g., robotics, AR/VR scene understanding).
Design of training curricula: Introducing explicit compositional objectives (e.g., contrastive losses that encourage orthogonal concept axes) could accelerate convergence to the desired geometry, potentially reducing the amount of data needed for robust generalization.
Embedding compression: Since only a low‑rank subspace is needed for each concept, we can design more efficient storage or transmission schemes that retain the orthogonal basis, benefiting edge‑device deployments.
Transfer learning: When fine‑tuning a pretrained encoder on a new domain, preserving the orthogonal subspace structure may help retain compositional abilities, guiding regularization strategies (e.g., orthogonal regularizers).
Interpretability tools: The linear‑orthogonal decomposition provides a natural way to visualize what each dimension “means” (e.g., “red‑circle axis”), aiding debugging and model explainability for safety‑critical applications.

Limitations & Future Work

Partial compliance: Real‑world models only approximate the ideal linear‑orthogonal geometry; the theory does not yet explain why certain concepts (e.g., textures) deviate more.
Dataset scope: Experiments rely on synthetic compositional benchmarks; extending the analysis to natural image datasets with richer semantics is an open step.
Training regimes: The study assumes standard supervised or contrastive training; it remains unclear how self‑supervised objectives (e.g., masked autoencoders) affect the geometry.
Dynamic concepts: Temporal or relational concepts (e.g., “object moving left of another”) are not covered; future work could explore whether similar geometric constraints hold in video embeddings.
Scalability of probing: Extracting per‑concept subspaces for thousands of concepts may become computationally expensive; more scalable factorization techniques are needed.

The authors provide a solid theoretical foundation and a practical toolkit, opening a clear path for the community to build vision systems that generalize compositionally—an essential capability for the next generation of intelligent applications.

Authors

Arnas Uselis
Andrea Dittadi
Seong Joon Oh

Paper Information

arXiv ID: 2602.24264v1
Categories: cs.CV, cs.LG
Published: February 27, 2026
PDF: Download PDF

[Paper] Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Mode Seeking meets Mean Seeking for Fast Long Video Generation

[Paper] Histopathology Image Normalization via Latent Manifold Compaction

[Paper] MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy

[Paper] SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching