[Paper] GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

Published: 1 day ago (June 3, 2026 at 01:49 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.05142v1

Overview

The paper GeM‑NR: Geometry‑Aware Multi‑View Editing for Nonrigid Scene Changes tackles a long‑standing gap in 3‑D image editing: how to make non‑rigid (shape‑changing) edits that stay consistent across many camera views, without having to train a new model for each task. By combining off‑the‑shelf depth estimation, point‑cloud alignment, and a conditioning‑based refinement step, the authors deliver a fast, training‑free pipeline that works with any existing 2‑D generative editor (e.g., FLUX, Qwen, BrushNet) and produces coherent multi‑view results even when the geometry of the scene is dramatically altered.

Key Contributions

Training‑free multi‑view editing framework that can be plugged into any 2‑D generative editor.
Geometry‑aware alignment: a novel depth‑map‑driven strategy that maximizes overlap between the 3‑D point clouds of the edited (anchor) and original (query) scenes, enabling large non‑rigid deformations.
Three‑stage pipeline (depth estimation → 3‑D projection → conditional refinement) that scales from two‑view to many‑view scenarios without a performance hit.
Extensive quantitative and qualitative evaluation showing state‑of‑the‑art consistency in both appearance and geometry across edited views.
Open‑source implementation (released with the paper) that allows developers to experiment with the method immediately.

Methodology

Anchor edit acquisition – The user first edits a single “anchor” image with any preferred 2‑D editor (e.g., text‑to‑image diffusion model). This edit defines the desired visual change (new shape, color, added objects, etc.).
Depth map estimation & point‑cloud alignment –
- A depth estimator (e.g., MiDaS) predicts per‑pixel depth for both the original query image and the edited anchor.
- The resulting depth maps are turned into 3‑D point clouds.
- The authors introduce a simple yet effective alignment step that maximizes the overlap between the two clouds, effectively “warping” the original geometry toward the edited geometry while preserving the underlying camera pose.
Projection onto the query viewpoint – The aligned 3‑D points are re‑projected into the coordinate frame of the target (query) view, producing a rough, geometry‑consistent draft of the edited image.
Conditional refinement – A conditioning network (implemented as a lightweight diffusion‑based in‑painting model) takes the draft image and the original query image as inputs and refines the result, correcting artifacts and ensuring photometric consistency (lighting, texture) with the surrounding scene.

Because the refinement step is conditioned on the original view, the method naturally extends to any number of viewpoints: the same anchor edit can be propagated to dozens of query images with only a single forward pass per view.

Results & Findings

Metric	Baseline (rigid‑only)	GeM‑NR (non‑rigid)
Multi‑view PSNR (appearance)	28.7 dB	31.4 dB
Chamfer Distance (geometry)	0.018	0.009
User study (consistency rating)	2.8 / 5	4.3 / 5

Geometric consistency: The Chamfer distance between reconstructed point clouds from different views drops by ~50 % compared to prior rigid‑only methods, confirming that the pipeline faithfully preserves the edited shape across viewpoints.
Photometric quality: PSNR and SSIM improvements indicate that the conditional refinement step restores realistic lighting and texture, even when the edit introduces large occlusions or new objects.
Scalability: Experiments with 2, 8, and 32 views show near‑linear runtime growth; the method processes a 512 × 512 image in ~0.8 s per view on a single RTX 3090.
Versatility: Demonstrations include bending a bent‑metal rod, reshaping a chair back, adding a new sculpture, and even morphing a human face—tasks that previously required task‑specific models.

Practical Implications

Rapid prototyping for AR/VR assets – Designers can edit a single reference image (e.g., “make this chair taller”) and instantly obtain a coherent multi‑view asset ready for 3‑D rendering pipelines.
Game content pipelines – Artists can generate variant geometry (damage, wear, customizations) without re‑authoring meshes; the output can be fed into neural radiance fields (NeRF) or traditional mesh reconstruction tools.
E‑commerce visual customization – Retail platforms could let shoppers modify product shapes (e.g., “stretch the sleeve”) and instantly preview the change from all angles, improving conversion rates.
Film VFX & post‑production – Non‑rigid scene edits (e.g., reshaping a prop) can be applied consistently across multiple camera shots without costly manual rotoscoping.
Open‑source community – Since GeM‑NR works with any off‑the‑shelf 2‑D editor, developers can experiment with emerging diffusion models without re‑training geometry‑aware networks.

Limitations & Future Work

Depth estimation dependency – The pipeline’s accuracy hinges on the quality of the initial depth maps; failure cases include glossy or texture‑less surfaces where depth predictors struggle.
Large viewpoint gaps – When the query view is dramatically different from the anchor (e.g., opposite sides of an object), the alignment may produce holes or stretching artifacts.
No explicit mesh output – While the method yields consistent images, converting the result to a clean, editable mesh still requires an additional reconstruction step.
Future directions suggested by the authors include integrating learned depth priors for more robust geometry, extending the conditioning network to handle video streams (temporal consistency), and exploring end‑to‑end differentiable alignment to further reduce artifacts in extreme viewpoint changes.

Authors

Josef Bengtson
Yaroslava Lochman
Fredrik Kahl

Paper Information

arXiv ID: 2606.05142v1
Categories: cs.CV, cs.AI
Published: June 3, 2026
PDF: Download PDF

[Paper] GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers

[Paper] Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting

[Paper] Continual Visual and Verbal Learning Through a Child's Egocentric Input

[Paper] Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have