[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
Source: arXiv - 2602.23359v1
Overview
The paper SeeThrough3D tackles a missing piece in text‑to‑image generation: occlusion awareness. While modern diffusion models can paint photorealistic scenes from a textual prompt and a 2‑D layout, they often ignore the depth ordering of objects, leading to unrealistic overlaps (e.g., a car appearing “in front of” a tree when it should be behind). The authors introduce a 3‑D‑centric pipeline that lets developers specify not only where objects are, but also how they hide behind each other, all while keeping full control over the virtual camera.
Key Contributions
- Occlusion‑aware 3‑D scene representation (OSCR): objects are encoded as translucent 3‑D boxes whose transparency signals hidden geometry.
- Camera‑controlled rendering: a lightweight renderer produces a 2‑D view from any desired viewpoint, giving explicit pose control during generation.
- Visual token injection: the rendered OSCR view is converted into a sequence of visual tokens that condition a pretrained flow‑based text‑to‑image diffusion model.
- Masked self‑attention binding: each object token is tightly coupled to its textual description, preventing attribute mixing across objects.
- Synthetic occlusion‑rich dataset: a large, procedurally generated collection of multi‑object scenes with strong inter‑object occlusions used to train the system.
- Zero‑shot generalization: the model can handle unseen object categories and novel camera angles without retraining.
Methodology
- Scene Encoding – For every object the user supplies a 3‑D bounding box (position, size, orientation) and a textual label. The box is rendered as a semi‑transparent cuboid; the degree of transparency encodes how much of the object is hidden behind others.
- View Synthesis – A simple differentiable renderer projects the translucent boxes onto a 2‑D canvas from a user‑chosen camera pose (azimuth, elevation, distance). The output is a layout image that already contains depth‑consistent occlusion cues.
- Tokenization – The layout image is split into patches and embedded into a sequence of visual tokens (similar to VQ‑VAE or CLIP‑based tokenizers).
- Conditioning the Diffusion Model – These visual tokens are concatenated with the text prompt tokens and fed into a pretrained flow‑based diffusion model. A masked self‑attention layer ensures each object token only attends to its own description, preserving attribute fidelity.
- Training – The whole conditioning pipeline is trained on the synthetic dataset, where ground‑truth images are rendered with perfect occlusion. The diffusion backbone remains frozen; only the token‑injection and attention modules are learned.
The result is a system that can take a prompt like “a red sports car behind a palm tree” together with a 3‑D layout and camera spec, and output a photo‑realistic image where the car is correctly hidden behind the tree.
Results & Findings
- Quantitative gains: On a held‑out test set, SeeThrough3D reduces occlusion‑related errors (measured by Intersection‑over‑Union of visible regions) by ~30 % compared to state‑of‑the‑art layout‑conditioned diffusion models.
- Qualitative improvement: Visual comparisons show far fewer “floating” objects and more coherent depth cues, especially in crowded scenes with multiple overlapping items.
- Generalization: The model successfully synthesizes scenes containing objects not seen during training (e.g., “a kite” or “a surfboard”) while preserving correct occlusion ordering.
- Camera flexibility: Users can rotate the virtual camera after the layout is defined, and the generated image updates consistently, demonstrating true 3‑D control.
Practical Implications
- Game and VR asset pipelines: Designers can script complex scenes (positions, depths, camera angles) and obtain high‑fidelity concept art without manually painting occlusions.
- E‑commerce & AR visualizers: Retailers can place products behind or in front of other items (e.g., a phone on a desk behind a coffee mug) and generate realistic marketing images on‑the‑fly.
- Automated storyboard creation: Filmmakers can define scene geometry and let the model render storyboard frames that respect proper depth, saving time on manual layout adjustments.
- Data augmentation for perception models: Synthetic training data with accurate occlusion patterns can improve object detection and depth estimation models, especially for safety‑critical domains like autonomous driving.
Limitations & Future Work
- Synthetic training bias: The model is trained on procedurally generated scenes; real‑world textures, lighting variations, and complex geometry (non‑box shapes) may not be perfectly captured.
- Box‑only geometry: Representing objects as cuboids limits fine‑grained occlusion details (e.g., a tree’s branches). Extending OSCR to mesh‑based or implicit representations could improve realism.
- Scalability of token injection: As scene complexity grows, the number of visual tokens rises, potentially stressing the diffusion model’s context window. Future work may explore hierarchical token compression or sparse attention.
- Interactive editing: Current pipeline is offline; integrating real‑time editing (drag‑and‑drop of objects) would make the system more usable for designers.
Overall, SeeThrough3D pushes text‑to‑image generation a step closer to true 3‑D reasoning, opening new doors for developers who need precise control over scene composition and camera viewpoint.
Authors
- Vaibhav Agrawal
- Rishubh Parihar
- Pradhaan Bhat
- Ravi Kiran Sarvadevabhatla
- R. Venkatesh Babu
Paper Information
- arXiv ID: 2602.23359v1
- Categories: cs.CV, cs.AI
- Published: February 26, 2026
- PDF: Download PDF