[Paper] Are Object-Centric Representations Better At Compositional Generalization?

Published: 3 days ago (February 18, 2026 at 01:34 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.16689v1

Overview

The paper investigates whether object‑centric (OC) visual representations—where a scene is broken down into discrete object slots—actually give machines a leg up on compositional generalization (i.e., handling novel combinations of familiar concepts). By building a Visual Question Answering (VQA) benchmark across three synthetic worlds, the authors compare OC models against state‑of‑the‑art dense vision encoders (DINOv2, SigLIP2) under tightly controlled conditions. Their results show that OC representations shine when data, compute, or diversity are limited, while dense encoders only catch up in more forgiving regimes.

Key Contributions

New VQA benchmark for compositional generalization spanning three controlled visual domains (CLEVRTex, Super‑CLEVR, MOVi‑C).
Fair head‑to‑head evaluation that equalizes training set size, representation dimensionality, downstream model capacity, and compute budget across OC and dense baselines.
Empirical evidence that OC models outperform dense encoders on harder compositional splits and are more sample‑efficient.
Comprehensive analysis of trade‑offs: when dense representations win (easy splits, abundant data) vs. when OC representations dominate (limited data/computation).
Open‑source release of benchmark data and evaluation scripts, enabling reproducibility and future extensions.

Methodology

Visual Worlds
- CLEVRTex – textured 3D objects on simple backgrounds.
- Super‑CLEVR – CLEVR‑style scenes with richer textures and lighting.
- MOVi‑C – short video clips with moving objects and occlusions.
Task – A VQA format: given an image (or a short clip) and a natural‑language question (e.g., “What color is the object left of the red sphere?”), the model must output the correct answer.
Compositional Splits
- Easy split: training and test sets share most object‑property combinations.
- Hard split: test set contains novel pairings of attributes (e.g., a shape‑color combo never seen during training).
Encoders Compared
- Dense: DINOv2 and SigLIP2 (large, pretrained vision transformers).
- Object‑centric: Slot‑based architectures built on the same backbone but trained to output a set of object slots (e.g., Slot Attention, MONet‑style).
Downstream Model
- A lightweight transformer that consumes either the dense embedding or the concatenated object slots and predicts the answer.
- Hyper‑parameters (layers, hidden size) are matched across both encoder families.
Controlled Variables
- Training data size: experiments run with 10 k, 50 k, and 200 k images.
- Representation size: dense vectors and total slot dimensions are kept equal.
- Compute budget: measured in GPU‑hours for both pre‑training and downstream fine‑tuning.
Metrics
- Accuracy on VQA answers.
- Sample efficiency: accuracy vs. number of training images.
- Compute efficiency: accuracy vs. GPU‑hours spent on downstream training.

Results & Findings

Setting	Dense (DINOv2 / SigLIP2)	Object‑Centric	Observation
Easy split, 200 k images	92 %	90 %	Dense wins, but gap is <2 %.
Hard split, 200 k images	71 %	78 %	OC outperforms dense by ~7 %.
Hard split, 10 k images	55 %	68 %	OC gains >13 % accuracy; dense struggles.
Compute‑limited downstream (≤ 2 GPU‑hrs)	63 %	74 %	OC needs far less fine‑tuning to reach high performance.

OC models excel when the test requires novel attribute combinations, especially under limited data or compute.
Dense encoders catch up only when the training set is large, diverse, and downstream compute is generous.
Sample efficiency curves show OC models reach 80 % of their final accuracy with ~30 % of the data needed by dense models.
Across all three visual worlds, the pattern holds, indicating the result is not domain‑specific.

Practical Implications

Edge & Mobile AI – Devices with constrained compute can benefit from OC encoders that need far less downstream fine‑tuning to handle new object‑attribute combos (e.g., AR apps that must recognize novel product variations).
Rapid Prototyping – Teams building VQA or visual reasoning pipelines can achieve strong performance with fewer labeled images by adopting slot‑based backbones.
Robotics & Autonomous Systems – OC representations naturally align with object‑level planning; the demonstrated compositional robustness means robots can adapt to unseen object configurations without massive data collection.
Foundation Model Fine‑Tuning – Even when using large pretrained vision transformers, adding an OC head may be a cost‑effective way to boost reasoning on downstream tasks that require compositionality (e.g., visual code assistants, inspection systems).
Data‑Efficient Curriculum Design – The benchmark suggests that curating diverse attribute combinations is more valuable than sheer volume when the goal is compositional generalization.

Limitations & Future Work

Synthetic environments – All three worlds are rendered and lack the messiness of real‑world photography (lighting variance, texture noise). Translating findings to natural images remains an open question.
Slot count fixed – The OC models use a predetermined number of slots; handling scenes with a highly variable object count could require dynamic slot mechanisms.
Downstream task scope – Only VQA was examined; other reasoning tasks (e.g., captioning, navigation) might exhibit different trade‑offs.
Scalability of OC pre‑training – Training object‑centric encoders from scratch on massive datasets (e.g., ImageNet‑21k) was not explored; future work could assess whether OC benefits persist at that scale.
Hybrid approaches – Combining dense embeddings with object slots (e.g., a dual‑stream architecture) could capture the best of both worlds; the authors suggest this as a promising direction.

Authors

Ferdinand Kapl
Amir Mohammad Karimi Mamaghan
Maximilian Seitzer
Karl Henrik Johansson
Carsten Marr
Stefan Bauer
Andrea Dittadi

Paper Information

arXiv ID: 2602.16689v1
Categories: cs.CV, cs.LG
Published: February 18, 2026
PDF: Download PDF

[Paper] Are Object-Centric Representations Better At Compositional Generalization?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

[Paper] Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

[Paper] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

[Paper] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents