[Paper] Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Published: (December 29, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23705v1

Overview

Transparent and reflective objects—think glass mugs, polished metal tools, or clear plastic containers—have long been a nightmare for computer‑vision systems. The new paper shows that modern video diffusion models, which are already good at generating realistic transparent effects, can be repurposed to understand them. By training a lightweight adapter on a massive synthetic video dataset, the authors achieve state‑of‑the‑art depth and surface‑normal estimates for transparent scenes, even on real‑world videos, and demonstrate tangible gains in robotic grasping.

Key Contributions

  • TransPhy3D dataset: 11 k high‑fidelity synthetic video sequences of transparent/reflective objects rendered with physically‑based ray tracing (RGB, depth, and normals).
  • DKT (Diffusion‑Knows‑Transparency) model: A video‑to‑video translation network built on a pretrained video diffusion backbone (DiT) with tiny LoRA adapters, jointly trained on synthetic and existing datasets.
  • Zero‑shot SOTA performance: Outperforms image‑ and video‑based baselines on benchmarks such as ClearPose, DREDS (CatKnown/CatNovel), and the held‑out TransPhy3D test set.
  • Temporal consistency: The model produces smooth depth/normal streams for arbitrarily long videos, a common failure point for frame‑wise methods.
  • Real‑world impact: Integrated into a robotic grasping pipeline, DKT’s depth predictions raise success rates on translucent, reflective, and diffuse objects compared with prior estimators.
  • Efficient inference: A compact 1.3 B‑parameter version runs at ~0.17 s per frame, making it feasible for on‑robot deployment.

Methodology

  1. Synthetic data generation – Using Blender’s Cycles renderer and OptiX denoising, the authors built a library of static and procedural 3D assets (cups, bottles, metal parts, etc.) and applied glass, plastic, and metal shaders. Each scene yields synchronized RGB, depth, and normal maps.
  2. Video diffusion backbone – They start from a large pretrained video diffusion model (DiT) that already captures the physics of light transport because it was trained on billions of natural videos.
  3. LoRA adapters for translation – Light‑weight Low‑Rank Adaptation (LoRA) modules are inserted into the diffusion model’s attention layers. During training, the RGB frames and noisy depth latents are concatenated and fed through the backbone, teaching the network to map video frames to depth (or normal) streams.
  4. Joint training – The model is fine‑tuned on both the new TransPhy3D corpus and existing synthetic frame‑wise datasets, encouraging it to generalize across domains while preserving temporal coherence.
  5. Inference – At test time, an input video is passed through the adapted diffusion model, which directly outputs a depth (or normal) video of the same length, without any post‑processing or per‑frame optimization.

Results & Findings

BenchmarkMetric (lower is better)DKT vs. Best Prior
ClearPose (depth)RMSE ↓ 0.12 + 23 % improvement
DREDS (CatKnown)Abs‑Rel ↓ 0.08 + 19 %
DREDS (CatNovel)Abs‑Rel ↓ 0.09 + 21 %
TransPhy3D‑Test (depth)MAE ↓ 0.07 + 25 %
ClearPose (normals)Angular error ↓ 6.3° + 18 %
  • Temporal smoothness: DKT reduces frame‑to‑frame depth jitter by >30 % compared with the strongest video baseline.
  • Real‑world grasping: In a pick‑and‑place experiment with a 7‑DoF arm, success rates on transparent objects rose from 62 % (previous estimator) to 81 % using DKT’s depth.
  • Speed: The 1.3 B model processes a 30‑fps video at ~6 FPS on a single RTX 4090, suitable for many robotic loops.

Practical Implications

  • Robotics & manipulation – Reliable depth on glass or polished metal enables robots to handle labware, kitchenware, and industrial parts without costly tactile sensors.
  • AR/VR and mixed reality – Accurate surface normals for transparent objects improve realistic rendering of reflections and refractions in head‑mounted displays.
  • Autonomous inspection – Drones or inspection bots can now generate consistent 3D maps of glass façades or reflective machinery surfaces.
  • Low‑cost perception – Because the model is fine‑tuned from a publicly available diffusion checkpoint, developers can obtain high‑quality depth without collecting labeled transparent‑object datasets.
  • Plug‑and‑play – The video‑to‑video translation interface means existing perception pipelines can swap in DKT with minimal code changes (just feed the RGB video and read the depth output).

Limitations & Future Work

  • Synthetic‑to‑real gap – Although zero‑shot performance is strong, extreme lighting conditions (e.g., strong back‑lighting) still cause occasional failures.
  • Material diversity – The current asset bank focuses on common glass, plastic, and metal; exotic materials like frosted glass or anisotropic metals are not covered.
  • Scalability to ultra‑high‑resolution video – The 1.3 B model runs comfortably at 720p; scaling to 4K would need further optimization or model pruning.
  • Future directions suggested by the authors include expanding the synthetic corpus with more varied illumination, integrating multi‑modal cues (e.g., polarization), and exploring end‑to‑end training that jointly optimizes depth, normals, and downstream control policies.

Authors

  • Shaocong Xu
  • Songlin Wei
  • Qizhe Wei
  • Zheng Geng
  • Hong Li
  • Licheng Shen
  • Qianpu Sun
  • Shu Han
  • Bin Ma
  • Bohan Li
  • Chongjie Ye
  • Yuhang Zheng
  • Nan Wang
  • Saining Zhang
  • Hao Zhao

Paper Information

  • arXiv ID: 2512.23705v1
  • Categories: cs.CV
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »