[Paper] Object-Centric Data Synthesis for Category-level Object Detection
Source: arXiv - 2511.23450v1
Overview
Object detection models excel when they have plenty of labeled images, but gathering such data for every new category is expensive—especially for rare or “long‑tailed” classes. This paper tackles the object‑centric data setting, where only a handful of clean, multi‑view photos or 3D models of a new object are available. By synthesizing realistic training images from this limited input, the authors show how to quickly extend detection models to novel categories without the usual data‑collection overhead.
Key Contributions
- Define the object‑centric data scenario and argue why it matters for scaling detection systems to new categories.
- Systematically evaluate four synthesis pipelines:
- Basic image compositing (cut‑paste + background blending).
- 3D rendering of CAD/mesh models into diverse scenes.
- Diffusion‑based image generation conditioned on object‑centric inputs.
- Hybrid approach combining rendering and diffusion refinement.
- Quantify the impact of contextual realism (clutter, lighting, occlusion) on downstream detection performance.
- Demonstrate sizable mAP gains (up to ~15 % absolute) on real‑world benchmarks when fine‑tuning detectors with synthesized data only.
- Provide an open‑source toolkit for reproducing the pipelines and benchmarking new synthesis methods.
Methodology
- Data Collection – For each novel category the authors gather a small set (≈5–10) of multi‑view RGB images or a 3D mesh. No bounding‑box annotations are required.
- Synthesis Pipelines –
- Cut‑Paste: Objects are segmented (using off‑the‑shelf masks) and pasted onto random background images with simple color‑matching.
- 3D Rendering: The mesh is textured (using the multi‑view photos) and rendered with a physics‑based engine under varied camera poses, lighting, and scene geometry.
- Diffusion: A text‑to‑image diffusion model (e.g., Stable Diffusion) is prompted with the object’s name and conditioned on the limited views to generate novel scenes.
- Hybrid: Rendered images are fed back into the diffusion model for style‑transfer and additional clutter.
- Training – A standard Faster‑RCNN / YOLO‑X detector pre‑trained on COCO is fine‑tuned on the synthetic images only. No real annotations of the new class are used.
- Evaluation – The fine‑tuned model is tested on a held‑out real‑world dataset containing the novel categories, measuring mean Average Precision (mAP) and recall.
Results & Findings
| Pipeline | mAP (synthetic‑only) | Δ vs. baseline (no new data) |
|---|---|---|
| Cut‑Paste | 22.3 % | +6.8 % |
| 3D Rendering | 27.9 % | +12.4 % |
| Diffusion | 25.1 % | +9.6 % |
| Hybrid (Render + Diffusion) | 31.5 % | +15.0 % |
- Context matters: Adding realistic clutter and varied lighting consistently outperforms clean, isolated composites.
- Hybrid approach wins: Rendering provides accurate geometry, while diffusion adds photorealistic texture and background complexity.
- Diminishing returns: Beyond ~20 synthetic images per class, gains plateau, suggesting a modest synthesis budget is sufficient.
- Cross‑category transfer: Models fine‑tuned on synthesized data for one novel class also improve detection of visually similar unseen classes, hinting at category‑level generalization.
Practical Implications
- Rapid onboarding of new products: E‑commerce platforms can generate detection data for a fresh SKU from a few product photos, cutting labeling costs dramatically.
- Robotics & AR: Service robots can learn to recognize new tools or objects on‑the‑fly using a handful of CAD files, without exhaustive scene capture.
- Edge deployment: Since the synthesis pipelines are lightweight (especially cut‑paste and rendering), teams can run them on‑premise to keep proprietary object models private.
- Dataset augmentation: Existing long‑tailed detection datasets can be balanced by synthesizing under‑represented classes, improving fairness and robustness.
Limitations & Future Work
- Domain gap: Even the best synthetic images still differ from real sensor noise, motion blur, and extreme lighting conditions; further domain‑adaptation tricks may be needed.
- Quality of 3D assets: The approach assumes reasonably accurate meshes; poor geometry can hurt detection more than help.
- Scalability of diffusion: High‑resolution diffusion generation is computationally expensive, limiting large‑scale batch synthesis.
- Future directions: The authors propose exploring self‑supervised fine‑tuning on unlabeled real images, integrating neural radiance fields (NeRF) for richer view synthesis, and automating prompt engineering for diffusion models.
Authors
- Vikhyat Agarwal
- Jiayi Cora Guo
- Declan Hoban
- Sissi Zhang
- Nicholas Moran
- Peter Cho
- Srilakshmi Pattabiraman
- Shantanu Joshi
Paper Information
- arXiv ID: 2511.23450v1
- Categories: cs.CV
- Published: November 28, 2025
- PDF: Download PDF