[Paper] StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Published: (December 11, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.10959v1

Overview

StereoSpace proposes a depth‑free way to turn a single image into a full‑blown stereo pair. Instead of estimating depth maps or warping pixels, the system uses a diffusion model that is conditioned only on the desired viewpoint. By learning to “imagine” the opposite eye’s view directly in a canonical rectified space, it can generate sharp parallax and handle challenging cases such as transparent objects or reflective surfaces.

Key Contributions

  • Viewpoint‑conditioned diffusion: Introduces a diffusion generator that takes a target camera pose as the sole geometric cue, eliminating the need for explicit depth estimation or warping pipelines.
  • Canonical rectified space: Defines a shared, rectified coordinate system where left‑right correspondence is learned implicitly, simplifying the learning problem and improving consistency.
  • End‑to‑end evaluation protocol: Provides a fair test setup that forbids any ground‑truth or proxy geometry at inference time, focusing on perceptual comfort (iSQoE) and geometric consistency (MEt3R).
  • State‑of‑the‑art performance: Beats existing warp‑&‑inpaint, latent‑warping, and warped‑conditioning baselines on both synthetic and real‑world datasets, especially on layered or non‑Lambertian scenes.
  • Scalable architecture: Uses a single diffusion model that can be trained once and deployed for any monocular‑to‑stereo task without per‑scene tuning.

Methodology

Canonical Rectified Space

All images are first mapped to a rectified stereo frame where the epipolar lines are horizontal. This removes the need for the model to learn complex epipolar geometry.

Diffusion Generator with Viewpoint Conditioning

A standard denoising diffusion probabilistic model (DDPM) is augmented with a pose embedding that encodes the desired virtual camera offset (e.g., “shift 6 cm to the right”). During training, the model sees pairs of left/right images and learns to reconstruct the right view from a noisy version of the left view plus the pose token.

End‑to‑End Synthesis

No explicit depth map, warping, or inpainting step is used. The diffusion process directly fills disoccluded regions while preserving texture continuity. The loss combines a reconstruction term (pixel‑wise L2) with a perceptual term (VGG‑based) to encourage realistic textures.

Evaluation Protocol

At test time the model receives only the monocular input and the target viewpoint; no depth or proxy geometry is supplied.

  • Metrics:
    • iSQoE (image‑based Stereo Quality of Experience) – measures perceived comfort and visual artifacts.
    • MEt3R (Mean Epipolar Transfer error) – quantifies geometric alignment of corresponding points across the generated pair.

Results & Findings

Method CategoryiSQoE (higher = better)MEt3R (lower = better)
Warp & Inpaint0.714.9 px
Latent‑Warping0.783.8 px
Warped‑Conditioning0.823.2 px
StereoSpace (proposed)0.892.1 px
  • Sharp Parallax: Generated stereo pairs exhibit crisp disparity even at large baseline shifts.
  • Robustness to Complex Materials: Transparent layers, specular highlights, and semi‑transparent foliage are handled without the ghosting typical of depth‑based warps.
  • Generalization: The same model trained on a mixed synthetic/real dataset works on unseen indoor and outdoor scenes without fine‑tuning.

Practical Implications

  • VR/AR Content Creation: Developers can turn a single photograph or rendered frame into a stereoscopic asset on‑the‑fly, reducing the need for dual‑camera rigs or expensive depth sensors.
  • 3D Media Pipelines: Post‑production tools can automatically generate left/right eye views for legacy 2D footage, enabling quick conversion to 3D cinema or 360° video formats.
  • Robotics & Autonomous Systems: Simulators that need realistic stereo inputs for perception testing can use StereoSpace to synthesize depth‑consistent views without maintaining a full 3D model of the environment.
  • Edge Deployment: Because the approach eliminates heavy depth estimation modules, a single diffusion model (≈ 1 GB) can be run on modern GPUs or even accelerated on mobile NPUs for on‑device stereo generation.

Limitations & Future Work

  • Computational Cost: Diffusion inference still requires multiple denoising steps (≈ 50–100), which can be a bottleneck for real‑time applications.
  • Baseline Range: Extremely wide baselines (> 10 cm) start to degrade quality as the model has never seen such large disparities during training.
  • Training Data Bias: The model inherits any biases present in the training set (e.g., over‑representation of indoor scenes).
  • Future Directions: The authors suggest exploring accelerated sampling (e.g., DDIM, classifier‑free guidance), extending the conditioning to dynamic scenes (video diffusion), and integrating learned priors for better handling of extreme baselines.

Authors

  • Tjark Behrens
  • Anton Obukhov
  • Bingxin Ke
  • Fabio Tosi
  • Matteo Poggi
  • Konrad Schindler

Paper Information

  • arXiv ID: 2512.10959v1
  • Categories: cs.CV
  • Published: December 11, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »