[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Published: (February 26, 2026 at 01:59 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.23359v1

Overview

The paper SeeThrough3D tackles a missing piece in text‑to‑image generation: occlusion awareness. While modern diffusion models can paint photorealistic scenes from a textual prompt and a 2‑D layout, they often ignore the depth ordering of objects, leading to unrealistic overlaps (e.g., a car appearing “in front of” a tree when it should be behind). The authors introduce a 3‑D‑centric pipeline that lets developers specify not only where objects are, but also how they hide behind each other, all while keeping full control over the virtual camera.

Key Contributions

  • Occlusion‑aware 3‑D scene representation (OSCR): objects are encoded as translucent 3‑D boxes whose transparency signals hidden geometry.
  • Camera‑controlled rendering: a lightweight renderer produces a 2‑D view from any desired viewpoint, giving explicit pose control during generation.
  • Visual token injection: the rendered OSCR view is converted into a sequence of visual tokens that condition a pretrained flow‑based text‑to‑image diffusion model.
  • Masked self‑attention binding: each object token is tightly coupled to its textual description, preventing attribute mixing across objects.
  • Synthetic occlusion‑rich dataset: a large, procedurally generated collection of multi‑object scenes with strong inter‑object occlusions used to train the system.
  • Zero‑shot generalization: the model can handle unseen object categories and novel camera angles without retraining.

Methodology

  1. Scene Encoding – For every object the user supplies a 3‑D bounding box (position, size, orientation) and a textual label. The box is rendered as a semi‑transparent cuboid; the degree of transparency encodes how much of the object is hidden behind others.
  2. View Synthesis – A simple differentiable renderer projects the translucent boxes onto a 2‑D canvas from a user‑chosen camera pose (azimuth, elevation, distance). The output is a layout image that already contains depth‑consistent occlusion cues.
  3. Tokenization – The layout image is split into patches and embedded into a sequence of visual tokens (similar to VQ‑VAE or CLIP‑based tokenizers).
  4. Conditioning the Diffusion Model – These visual tokens are concatenated with the text prompt tokens and fed into a pretrained flow‑based diffusion model. A masked self‑attention layer ensures each object token only attends to its own description, preserving attribute fidelity.
  5. Training – The whole conditioning pipeline is trained on the synthetic dataset, where ground‑truth images are rendered with perfect occlusion. The diffusion backbone remains frozen; only the token‑injection and attention modules are learned.

The result is a system that can take a prompt like “a red sports car behind a palm tree” together with a 3‑D layout and camera spec, and output a photo‑realistic image where the car is correctly hidden behind the tree.

Results & Findings

  • Quantitative gains: On a held‑out test set, SeeThrough3D reduces occlusion‑related errors (measured by Intersection‑over‑Union of visible regions) by ~30 % compared to state‑of‑the‑art layout‑conditioned diffusion models.
  • Qualitative improvement: Visual comparisons show far fewer “floating” objects and more coherent depth cues, especially in crowded scenes with multiple overlapping items.
  • Generalization: The model successfully synthesizes scenes containing objects not seen during training (e.g., “a kite” or “a surfboard”) while preserving correct occlusion ordering.
  • Camera flexibility: Users can rotate the virtual camera after the layout is defined, and the generated image updates consistently, demonstrating true 3‑D control.

Practical Implications

  • Game and VR asset pipelines: Designers can script complex scenes (positions, depths, camera angles) and obtain high‑fidelity concept art without manually painting occlusions.
  • E‑commerce & AR visualizers: Retailers can place products behind or in front of other items (e.g., a phone on a desk behind a coffee mug) and generate realistic marketing images on‑the‑fly.
  • Automated storyboard creation: Filmmakers can define scene geometry and let the model render storyboard frames that respect proper depth, saving time on manual layout adjustments.
  • Data augmentation for perception models: Synthetic training data with accurate occlusion patterns can improve object detection and depth estimation models, especially for safety‑critical domains like autonomous driving.

Limitations & Future Work

  • Synthetic training bias: The model is trained on procedurally generated scenes; real‑world textures, lighting variations, and complex geometry (non‑box shapes) may not be perfectly captured.
  • Box‑only geometry: Representing objects as cuboids limits fine‑grained occlusion details (e.g., a tree’s branches). Extending OSCR to mesh‑based or implicit representations could improve realism.
  • Scalability of token injection: As scene complexity grows, the number of visual tokens rises, potentially stressing the diffusion model’s context window. Future work may explore hierarchical token compression or sparse attention.
  • Interactive editing: Current pipeline is offline; integrating real‑time editing (drag‑and‑drop of objects) would make the system more usable for designers.

Overall, SeeThrough3D pushes text‑to‑image generation a step closer to true 3‑D reasoning, opening new doors for developers who need precise control over scene composition and camera viewpoint.

Authors

  • Vaibhav Agrawal
  • Rishubh Parihar
  • Pradhaan Bhat
  • Ravi Kiran Sarvadevabhatla
  • R. Venkatesh Babu

Paper Information

  • arXiv ID: 2602.23359v1
  • Categories: cs.CV, cs.AI
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...