[Paper] IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition

Published: (December 29, 2025 at 01:24 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23667v1

Overview

The paper introduces Intrinsic Decomposition Transformer (IDT), a feed‑forward neural architecture that can split a set of multi‑view RGB images into physically meaningful components—diffuse reflectance, diffuse shading, and specular shading—in a single forward pass. By using transformer‑style attention across views, IDT delivers consistent intrinsic maps without the costly iterative sampling that diffusion‑based methods require, making multi‑view intrinsic decomposition practical for real‑world pipelines.

Key Contributions

  • Transformer‑based multi‑view reasoning: Jointly processes an arbitrary number of input views with self‑attention, enforcing cross‑view consistency.
  • Physically grounded factorization: Explicitly models the image formation equation as I = R·S_d + S_s, separating Lambertian (diffuse) and non‑Lambertian (specular) transport.
  • Feed‑forward design: Eliminates iterative generative steps, enabling real‑time inference on typical GPU hardware.
  • Improved visual quality: Produces cleaner diffuse albedo, smoother shading, and more isolated specular highlights compared with prior single‑view and multi‑view baselines.
  • Extensive evaluation: Demonstrates superior quantitative metrics and qualitative results on both synthetic benchmark datasets and real‑world captures.

Methodology

  1. Input handling: A variable‑length list of RGB images captured from different camera poses is fed into a shared CNN encoder that extracts per‑pixel feature maps.
  2. Cross‑view attention: The feature maps are flattened into tokens and passed through a standard transformer encoder. Self‑attention lets each token “see” information from all other views, allowing the network to learn view‑invariant material cues while preserving view‑dependent lighting cues.
  3. Physically informed decoder: The transformer output is split into three branches, each decoded by a lightweight CNN head to predict:
    • Diffuse reflectance (R) – the intrinsic color of the surface.
    • Diffuse shading (S_d) – illumination that follows Lambert’s cosine law.
    • Specular shading (S_s) – view‑dependent highlights.
      The three outputs are combined using the image formation model I = R·S_d + S_s to reconstruct the input, providing an implicit self‑supervision signal.
  4. Losses:
    • Reconstruction loss (L1 between reconstructed and original images).
    • Reflectance consistency loss across views (encourages identical albedo for the same surface point).
    • Shading smoothness and specular sparsity regularizers to enforce physically plausible behavior.
  5. Training: The network is trained end‑to‑end on synthetic datasets where ground‑truth intrinsic components are available, then fine‑tuned on real captures using the self‑supervised reconstruction loss.

Results & Findings

DatasetMetric (lower is better)Diffuse Albedo ErrorShading ConsistencySpecular Isolation
Synthetic Multi‑View (SYN‑MV)MAE (albedo)0.042 (vs. 0.067)0.018 (vs. 0.031)0.021 (vs. 0.038)
Real‑World Capture (RWC)Visual Consistency Score0.73 (vs. 0.58)
  • Cleaner albedo: IDT removes view‑dependent color bleed, yielding uniform material colors across angles.
  • Coherent shading: Diffuse shading maps are smooth across viewpoints, reflecting consistent illumination.
  • Specular separation: Highlights are isolated into the specular branch, making downstream relighting or material editing easier.
  • Speed: A full multi‑view batch (8 × 512×512 images) processes in ~120 ms on an RTX 3090, far faster than diffusion‑based iterative methods that need seconds per view.

Practical Implications

  • Real‑time relighting & AR: Developers can extract view‑consistent albedo and shading on‑the‑fly, enabling dynamic lighting changes in mixed‑reality applications without re‑rendering the entire scene.
  • Material digitization: Clean diffuse maps simplify texture creation for game assets or product visualizations, while specular maps can be directly used in PBR pipelines.
  • Robotics & perception: Consistent intrinsic decomposition aids illumination‑invariant object detection and surface property estimation for autonomous agents operating under varying lighting.
  • Content creation tools: Photo‑editing software can offer “material‑aware” adjustments (e.g., recolor, highlight removal) that respect the underlying physics, thanks to the separated components.
  • Scalable pipelines: Because IDT is feed‑forward, it can be integrated into batch processing or streaming systems without the memory‑heavy sampling loops of diffusion models.

Limitations & Future Work

  • Dependence on accurate pose: The current implementation assumes known camera extrinsics; errors in pose estimation can degrade consistency.
  • Synthetic‑to‑real gap: While fine‑tuning helps, the model still struggles with extreme outdoor lighting (e.g., strong directional sunlight) not seen during training.
  • Fixed number of views during training: Although inference accepts variable lengths, the network is optimized for a specific range (4–8 views), and performance may drop with very sparse or dense view sets.
  • Future directions: The authors suggest integrating learned pose refinement, expanding the training corpus with more diverse real‑world captures, and exploring hierarchical transformers to handle thousands of views for large‑scale scene reconstruction.

Authors

  • Kang Du
  • Yirui Guan
  • Zeyu Wang

Paper Information

  • arXiv ID: 2512.23667v1
  • Categories: cs.CV
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »