[Paper] CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models

Published: 2 months ago (December 2, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.03045v1

Overview

Multi‑view diffusion models have become the go‑to tool for generating novel views of a scene from a single reference image, but the inner workings that keep the generated views geometrically consistent have remained a mystery. The new CAMEO framework uncovers how attention maps implicitly learn cross‑view correspondences and shows how a tiny amount of supervision can dramatically speed up training and boost synthesis quality.

Key Contributions

Empirical discovery: Attention maps in existing multi‑view diffusion models already encode geometric correspondences between reference and target views, but the signal degrades with large viewpoint changes.
CAMEO training scheme: Introduces a lightweight supervision signal that directly aligns attention maps with ground‑truth geometric correspondences (e.g., depth or flow maps).
Single‑layer supervision: Demonstrates that supervising just one attention layer is enough to steer the whole network toward accurate cross‑view alignment.
Training efficiency: Cuts required training iterations by ~50 % while delivering higher‑quality novel view synthesis at the same iteration budget.
Model‑agnostic design: CAMEO can be plugged into any existing multi‑view diffusion architecture without architectural changes.

Methodology

Diagnosing attention correspondence
- The authors first visualized attention maps of a vanilla multi‑view diffusion model during training.
- By overlaying known 3D correspondences (derived from depth or optical flow), they confirmed that many heads attend to the correct spatial locations across views, but the alignment becomes noisy for extreme camera rotations.
Supervising attention with geometry
- They construct a correspondence loss that penalizes the distance between the model’s attention distribution and a “ground‑truth” correspondence map (computed from a pre‑computed depth/flow pipeline).
- This loss is applied to a single attention layer (typically a middle‑layer self‑attention block) while the rest of the diffusion model continues to be trained with the standard denoising objective.
Training loop
- For each training step, the model receives a reference image and a target viewpoint.
- The diffusion loss (predicting noise) and the correspondence loss are summed (with a small weighting factor).
- Because the supervision is sparse (one layer, one loss term), the extra compute overhead is negligible.
Integration
- CAMEO is implemented as a drop‑in module: replace the chosen attention block with a “CAMEO‑enabled” version that outputs both the usual attention weights and a loss term.
- No changes to the diffusion scheduler, architecture, or inference pipeline are required.

Results & Findings

Metric (lower is better)	Baseline (no CAMEO)	CAMEO (single‑layer)
LPIPS (perceptual similarity)	0.215	0.162
PSNR (dB)	24.8	27.3
Training iterations to converge*	200k	≈100k

*Convergence defined as reaching a plateau in validation LPIPS.

Quality boost: Across several public multi‑view datasets (e.g., RealEstate10K, LLFF), CAMEO consistently improves texture fidelity and preserves fine geometric details.
Faster convergence: The correspondence loss acts as a strong regularizer, guiding the model to learn the correct geometry early, halving the number of required diffusion steps.
Robustness to large view changes: Even when the target view is 90° away from the reference, CAMEO‑trained models maintain coherent structures, whereas the baseline often produces warped or duplicated objects.

Practical Implications

Faster prototyping: Teams building AR/VR content generators can train high‑quality multi‑view diffusion models in weeks instead of months, reducing cloud‑compute costs.
Plug‑and‑play upgrades: Existing pipelines (e.g., DreamFusion‑style 3‑D generation, view‑consistent image‑to‑video tools) can be upgraded by adding CAMEO supervision to a single attention block—no need to redesign the whole network.
Better downstream tasks: More accurate geometry in generated views benefits downstream applications such as 3‑D reconstruction, scene editing, and neural rendering, where consistency across viewpoints is critical.
Developer‑friendly tooling: Because the loss only requires a pre‑computed correspondence map (depth/flow), developers can reuse off‑the‑shelf depth estimators or even synthetic depth from CAD models, making integration straightforward.

Limitations & Future Work

Dependence on correspondence quality: CAMEO’s supervision is only as good as the ground‑truth flow or depth maps; noisy estimates can propagate errors.
Single‑layer focus: While supervising one layer works well, the authors note that extremely complex scenes (e.g., heavy occlusions) might benefit from multi‑layer or hierarchical supervision.
Scalability to very high resolutions: The current experiments cap at 512 × 512; extending to 4K‑level textures may require additional memory‑efficient attention mechanisms.
Future directions: The paper suggests exploring learned correspondence generators (instead of external depth estimators), adaptive weighting of the correspondence loss during training, and applying CAMEO to other generative paradigms such as video diffusion models.

Authors

Minkyung Kwon
Jinhyeok Choi
Jiho Park
Seonghu Jeon
Jinhyuk Jang
Junyoung Seo
Minseop Kwak
Jin‑Hwa Kim
Seungryong Kim

Paper Information

arXiv ID: 2512.03045v1
Categories: cs.CV
Published: December 2, 2025
PDF: Download PDF

[Paper] CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] EditThinker: Unlocking Iterative Reasoning for Any Image Editor

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models