[Paper] Object-Centric Data Synthesis for Category-level Object Detection

Published: 2 months ago (November 28, 2025 at 01:41 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.23450v1

Overview

Object detection models excel when they have plenty of labeled images, but gathering such data for every new category is expensive—especially for rare or “long‑tailed” classes. This paper tackles the object‑centric data setting, where only a handful of clean, multi‑view photos or 3D models of a new object are available. By synthesizing realistic training images from this limited input, the authors show how to quickly extend detection models to novel categories without the usual data‑collection overhead.

Key Contributions

Define the object‑centric data scenario and argue why it matters for scaling detection systems to new categories.
Systematically evaluate four synthesis pipelines:
1. Basic image compositing (cut‑paste + background blending).
2. 3D rendering of CAD/mesh models into diverse scenes.
3. Diffusion‑based image generation conditioned on object‑centric inputs.
4. Hybrid approach combining rendering and diffusion refinement.
Quantify the impact of contextual realism (clutter, lighting, occlusion) on downstream detection performance.
Demonstrate sizable mAP gains (up to ~15 % absolute) on real‑world benchmarks when fine‑tuning detectors with synthesized data only.
Provide an open‑source toolkit for reproducing the pipelines and benchmarking new synthesis methods.

Methodology

Data Collection – For each novel category the authors gather a small set (≈5–10) of multi‑view RGB images or a 3D mesh. No bounding‑box annotations are required.
Synthesis Pipelines –
- Cut‑Paste: Objects are segmented (using off‑the‑shelf masks) and pasted onto random background images with simple color‑matching.
- 3D Rendering: The mesh is textured (using the multi‑view photos) and rendered with a physics‑based engine under varied camera poses, lighting, and scene geometry.
- Diffusion: A text‑to‑image diffusion model (e.g., Stable Diffusion) is prompted with the object’s name and conditioned on the limited views to generate novel scenes.
- Hybrid: Rendered images are fed back into the diffusion model for style‑transfer and additional clutter.
Training – A standard Faster‑RCNN / YOLO‑X detector pre‑trained on COCO is fine‑tuned on the synthetic images only. No real annotations of the new class are used.
Evaluation – The fine‑tuned model is tested on a held‑out real‑world dataset containing the novel categories, measuring mean Average Precision (mAP) and recall.

Results & Findings

Pipeline	mAP (synthetic‑only)	Δ vs. baseline (no new data)
Cut‑Paste	22.3 %	+6.8 %
3D Rendering	27.9 %	+12.4 %
Diffusion	25.1 %	+9.6 %
Hybrid (Render + Diffusion)	31.5 %	+15.0 %

Context matters: Adding realistic clutter and varied lighting consistently outperforms clean, isolated composites.
Hybrid approach wins: Rendering provides accurate geometry, while diffusion adds photorealistic texture and background complexity.
Diminishing returns: Beyond ~20 synthetic images per class, gains plateau, suggesting a modest synthesis budget is sufficient.
Cross‑category transfer: Models fine‑tuned on synthesized data for one novel class also improve detection of visually similar unseen classes, hinting at category‑level generalization.

Practical Implications

Rapid onboarding of new products: E‑commerce platforms can generate detection data for a fresh SKU from a few product photos, cutting labeling costs dramatically.
Robotics & AR: Service robots can learn to recognize new tools or objects on‑the‑fly using a handful of CAD files, without exhaustive scene capture.
Edge deployment: Since the synthesis pipelines are lightweight (especially cut‑paste and rendering), teams can run them on‑premise to keep proprietary object models private.
Dataset augmentation: Existing long‑tailed detection datasets can be balanced by synthesizing under‑represented classes, improving fairness and robustness.

Limitations & Future Work

Domain gap: Even the best synthetic images still differ from real sensor noise, motion blur, and extreme lighting conditions; further domain‑adaptation tricks may be needed.
Quality of 3D assets: The approach assumes reasonably accurate meshes; poor geometry can hurt detection more than help.
Scalability of diffusion: High‑resolution diffusion generation is computationally expensive, limiting large‑scale batch synthesis.
Future directions: The authors propose exploring self‑supervised fine‑tuning on unlabeled real images, integrating neural radiance fields (NeRF) for richer view synthesis, and automating prompt engineering for diffusion models.

Authors

Vikhyat Agarwal
Jiayi Cora Guo
Declan Hoban
Sissi Zhang
Nicholas Moran
Peter Cho
Srilakshmi Pattabiraman
Shantanu Joshi

Paper Information

arXiv ID: 2511.23450v1
Categories: cs.CV
Published: November 28, 2025
PDF: Download PDF

[Paper] Object-Centric Data Synthesis for Category-level Object Detection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations

[Paper] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

[Paper] Visual Generation Tuning