[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Published: 3 days ago (February 26, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.23359v1

Overview

The paper SeeThrough3D tackles a missing piece in text‑to‑image generation: occlusion awareness. While modern diffusion models can paint photorealistic scenes from a textual prompt and a 2‑D layout, they often ignore the depth ordering of objects, leading to unrealistic overlaps (e.g., a car appearing “in front of” a tree when it should be behind). The authors introduce a 3‑D‑centric pipeline that lets developers specify not only where objects are, but also how they hide behind each other, all while keeping full control over the virtual camera.

Key Contributions

Occlusion‑aware 3‑D scene representation (OSCR): objects are encoded as translucent 3‑D boxes whose transparency signals hidden geometry.
Camera‑controlled rendering: a lightweight renderer produces a 2‑D view from any desired viewpoint, giving explicit pose control during generation.
Visual token injection: the rendered OSCR view is converted into a sequence of visual tokens that condition a pretrained flow‑based text‑to‑image diffusion model.
Masked self‑attention binding: each object token is tightly coupled to its textual description, preventing attribute mixing across objects.
Synthetic occlusion‑rich dataset: a large, procedurally generated collection of multi‑object scenes with strong inter‑object occlusions used to train the system.
Zero‑shot generalization: the model can handle unseen object categories and novel camera angles without retraining.

Methodology

Scene Encoding – For every object the user supplies a 3‑D bounding box (position, size, orientation) and a textual label. The box is rendered as a semi‑transparent cuboid; the degree of transparency encodes how much of the object is hidden behind others.
View Synthesis – A simple differentiable renderer projects the translucent boxes onto a 2‑D canvas from a user‑chosen camera pose (azimuth, elevation, distance). The output is a layout image that already contains depth‑consistent occlusion cues.
Tokenization – The layout image is split into patches and embedded into a sequence of visual tokens (similar to VQ‑VAE or CLIP‑based tokenizers).
Conditioning the Diffusion Model – These visual tokens are concatenated with the text prompt tokens and fed into a pretrained flow‑based diffusion model. A masked self‑attention layer ensures each object token only attends to its own description, preserving attribute fidelity.
Training – The whole conditioning pipeline is trained on the synthetic dataset, where ground‑truth images are rendered with perfect occlusion. The diffusion backbone remains frozen; only the token‑injection and attention modules are learned.

The result is a system that can take a prompt like “a red sports car behind a palm tree” together with a 3‑D layout and camera spec, and output a photo‑realistic image where the car is correctly hidden behind the tree.

Results & Findings

Quantitative gains: On a held‑out test set, SeeThrough3D reduces occlusion‑related errors (measured by Intersection‑over‑Union of visible regions) by ~30 % compared to state‑of‑the‑art layout‑conditioned diffusion models.
Qualitative improvement: Visual comparisons show far fewer “floating” objects and more coherent depth cues, especially in crowded scenes with multiple overlapping items.
Generalization: The model successfully synthesizes scenes containing objects not seen during training (e.g., “a kite” or “a surfboard”) while preserving correct occlusion ordering.
Camera flexibility: Users can rotate the virtual camera after the layout is defined, and the generated image updates consistently, demonstrating true 3‑D control.

Practical Implications

Game and VR asset pipelines: Designers can script complex scenes (positions, depths, camera angles) and obtain high‑fidelity concept art without manually painting occlusions.
E‑commerce & AR visualizers: Retailers can place products behind or in front of other items (e.g., a phone on a desk behind a coffee mug) and generate realistic marketing images on‑the‑fly.
Automated storyboard creation: Filmmakers can define scene geometry and let the model render storyboard frames that respect proper depth, saving time on manual layout adjustments.
Data augmentation for perception models: Synthetic training data with accurate occlusion patterns can improve object detection and depth estimation models, especially for safety‑critical domains like autonomous driving.

Limitations & Future Work

Synthetic training bias: The model is trained on procedurally generated scenes; real‑world textures, lighting variations, and complex geometry (non‑box shapes) may not be perfectly captured.
Box‑only geometry: Representing objects as cuboids limits fine‑grained occlusion details (e.g., a tree’s branches). Extending OSCR to mesh‑based or implicit representations could improve realism.
Scalability of token injection: As scene complexity grows, the number of visual tokens rises, potentially stressing the diffusion model’s context window. Future work may explore hierarchical token compression or sparse attention.
Interactive editing: Current pipeline is offline; integrating real‑time editing (drag‑and‑drop of objects) would make the system more usable for designers.

Overall, SeeThrough3D pushes text‑to‑image generation a step closer to true 3‑D reasoning, opening new doors for developers who need precise control over scene composition and camera viewpoint.

Authors

Vaibhav Agrawal
Rishubh Parihar
Pradhaan Bhat
Ravi Kiran Sarvadevabhatla
R. Venkatesh Babu

Paper Information

arXiv ID: 2602.23359v1
Categories: cs.CV, cs.AI
Published: February 26, 2026
PDF: Download PDF

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A Dataset is Worth 1 MB

[Paper] ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

[Paper] MediX-R1: Open Ended Medical Reinforcement Learning

[Paper] VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale