[Paper] Are Object-Centric Representations Better At Compositional Generalization?
Source: arXiv - 2602.16689v1
Overview
The paper investigates whether object‑centric (OC) visual representations—where a scene is broken down into discrete object slots—actually give machines a leg up on compositional generalization (i.e., handling novel combinations of familiar concepts). By building a Visual Question Answering (VQA) benchmark across three synthetic worlds, the authors compare OC models against state‑of‑the‑art dense vision encoders (DINOv2, SigLIP2) under tightly controlled conditions. Their results show that OC representations shine when data, compute, or diversity are limited, while dense encoders only catch up in more forgiving regimes.
Key Contributions
- New VQA benchmark for compositional generalization spanning three controlled visual domains (CLEVRTex, Super‑CLEVR, MOVi‑C).
- Fair head‑to‑head evaluation that equalizes training set size, representation dimensionality, downstream model capacity, and compute budget across OC and dense baselines.
- Empirical evidence that OC models outperform dense encoders on harder compositional splits and are more sample‑efficient.
- Comprehensive analysis of trade‑offs: when dense representations win (easy splits, abundant data) vs. when OC representations dominate (limited data/computation).
- Open‑source release of benchmark data and evaluation scripts, enabling reproducibility and future extensions.
Methodology
-
Visual Worlds
- CLEVRTex – textured 3D objects on simple backgrounds.
- Super‑CLEVR – CLEVR‑style scenes with richer textures and lighting.
- MOVi‑C – short video clips with moving objects and occlusions.
-
Task – A VQA format: given an image (or a short clip) and a natural‑language question (e.g., “What color is the object left of the red sphere?”), the model must output the correct answer.
-
Compositional Splits
- Easy split: training and test sets share most object‑property combinations.
- Hard split: test set contains novel pairings of attributes (e.g., a shape‑color combo never seen during training).
-
Encoders Compared
- Dense: DINOv2 and SigLIP2 (large, pretrained vision transformers).
- Object‑centric: Slot‑based architectures built on the same backbone but trained to output a set of object slots (e.g., Slot Attention, MONet‑style).
-
Downstream Model
- A lightweight transformer that consumes either the dense embedding or the concatenated object slots and predicts the answer.
- Hyper‑parameters (layers, hidden size) are matched across both encoder families.
-
Controlled Variables
- Training data size: experiments run with 10 k, 50 k, and 200 k images.
- Representation size: dense vectors and total slot dimensions are kept equal.
- Compute budget: measured in GPU‑hours for both pre‑training and downstream fine‑tuning.
-
Metrics
- Accuracy on VQA answers.
- Sample efficiency: accuracy vs. number of training images.
- Compute efficiency: accuracy vs. GPU‑hours spent on downstream training.
Results & Findings
| Setting | Dense (DINOv2 / SigLIP2) | Object‑Centric | Observation |
|---|---|---|---|
| Easy split, 200 k images | 92 % | 90 % | Dense wins, but gap is <2 %. |
| Hard split, 200 k images | 71 % | 78 % | OC outperforms dense by ~7 %. |
| Hard split, 10 k images | 55 % | 68 % | OC gains >13 % accuracy; dense struggles. |
| Compute‑limited downstream (≤ 2 GPU‑hrs) | 63 % | 74 % | OC needs far less fine‑tuning to reach high performance. |
- OC models excel when the test requires novel attribute combinations, especially under limited data or compute.
- Dense encoders catch up only when the training set is large, diverse, and downstream compute is generous.
- Sample efficiency curves show OC models reach 80 % of their final accuracy with ~30 % of the data needed by dense models.
- Across all three visual worlds, the pattern holds, indicating the result is not domain‑specific.
Practical Implications
- Edge & Mobile AI – Devices with constrained compute can benefit from OC encoders that need far less downstream fine‑tuning to handle new object‑attribute combos (e.g., AR apps that must recognize novel product variations).
- Rapid Prototyping – Teams building VQA or visual reasoning pipelines can achieve strong performance with fewer labeled images by adopting slot‑based backbones.
- Robotics & Autonomous Systems – OC representations naturally align with object‑level planning; the demonstrated compositional robustness means robots can adapt to unseen object configurations without massive data collection.
- Foundation Model Fine‑Tuning – Even when using large pretrained vision transformers, adding an OC head may be a cost‑effective way to boost reasoning on downstream tasks that require compositionality (e.g., visual code assistants, inspection systems).
- Data‑Efficient Curriculum Design – The benchmark suggests that curating diverse attribute combinations is more valuable than sheer volume when the goal is compositional generalization.
Limitations & Future Work
- Synthetic environments – All three worlds are rendered and lack the messiness of real‑world photography (lighting variance, texture noise). Translating findings to natural images remains an open question.
- Slot count fixed – The OC models use a predetermined number of slots; handling scenes with a highly variable object count could require dynamic slot mechanisms.
- Downstream task scope – Only VQA was examined; other reasoning tasks (e.g., captioning, navigation) might exhibit different trade‑offs.
- Scalability of OC pre‑training – Training object‑centric encoders from scratch on massive datasets (e.g., ImageNet‑21k) was not explored; future work could assess whether OC benefits persist at that scale.
- Hybrid approaches – Combining dense embeddings with object slots (e.g., a dual‑stream architecture) could capture the best of both worlds; the authors suggest this as a promising direction.
Authors
- Ferdinand Kapl
- Amir Mohammad Karimi Mamaghan
- Maximilian Seitzer
- Karl Henrik Johansson
- Carsten Marr
- Stefan Bauer
- Andrea Dittadi
Paper Information
- arXiv ID: 2602.16689v1
- Categories: cs.CV, cs.LG
- Published: February 18, 2026
- PDF: Download PDF