[Paper] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images

Published: 1 month ago (December 31, 2025 at 01:59 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.25071v1

Overview

Edit3r is a new feed‑forward system that can reconstruct a 3D scene and apply user‑driven edits in a single forward pass, even when the input images are sparse, unaligned, and have already been altered with 2‑D editing tools. By sidestepping the costly per‑scene optimization that dominates prior work, Edit3r makes real‑time, photorealistic 3D editing feasible for developers building AR/VR, game, and visual‑effects pipelines.

Key Contributions

Instant 3D reconstruction & editing from a handful of unposed, view‑inconsistent images – no iterative optimization or pose estimation required.
Cross‑view consistent supervision via a SAM2‑based recoloring pipeline that automatically generates edited multi‑view training pairs.
Asymmetric input strategy that fuses a recolored reference view with raw auxiliary views, teaching the network to align disparate observations.
DL3DV‑Edit‑Bench, a new benchmark (20 scenes, 4 edit types, 100 total edits) for systematic evaluation of 3‑D editing quality and speed.
State‑of‑the‑art performance: higher semantic alignment and 3‑D consistency than recent baselines while running orders of magnitude faster.

Methodology

Data Preparation
- Start with unedited multi‑view images from the DL3DV dataset.
- Apply a SAM2‑driven recoloring step that propagates a 2‑D edit (e.g., “make the wall red”) consistently across all views, creating a pseudo‑ground‑truth edited set.
Network Architecture
- A single encoder‑decoder takes as input an asymmetric bundle: one recolored reference view + several raw views.
- The encoder learns to merge heterogeneous observations, while the decoder predicts a NeRF‑style volumetric field that already incorporates the instructed edit.
Training Objective
- Photometric loss between rendered novel views and the SAM2‑recolored supervision ensures cross‑view consistency.
- Semantic alignment loss (using CLIP embeddings) encourages the edited geometry to match the textual instruction.
Inference
- Users supply any set of sparse photos (no pose info) and a textual edit (or a 2‑D edited image from tools like InstructPix2Pix).
- The model instantly outputs a renderable 3‑D representation that reflects the edit, ready for downstream rendering or interaction.

Results & Findings

Metric	Edit3r	Prior Optim‑Based Methods
Semantic Alignment (CLIP‑Score)	0.78	0.62
3‑D Consistency (Multi‑View PSNR)	28.4 dB	24.1 dB
Inference Time (per scene)	≈0.3 s	≈30 s – 5 min

Qualitative examples show Edit3r accurately changing colors, adding objects, or removing elements while preserving geometry and lighting across unseen viewpoints.
The model generalizes to edits it never saw during training (e.g., stylized sketches from InstructPix2Pix), confirming robustness to diverse 2‑D editing pipelines.
On the newly released DL3DV‑Edit‑Bench, Edit3r consistently outperforms baselines on all four edit categories (color change, texture swap, object addition, object removal).

Practical Implications

Real‑time AR/VR content creation: developers can let end‑users snap a few photos of a room, type “make the sofa blue”, and instantly obtain a 3‑D scene ready for rendering or physics simulation.
Game asset pipelines: artists can rapidly prototype level edits without manually retopologizing or re‑baking textures; the feed‑forward model handles the heavy lifting.
Visual effects & post‑production: on‑set footage can be edited on the fly, enabling quick iteration on set extensions or matte‑painting adjustments.
Integration with existing 2‑D editors: because Edit3r works with outputs from tools like InstructPix2Pix, studios can keep their familiar 2‑D workflows while gaining 3‑D capabilities with minimal engineering effort.

Limitations & Future Work

Sparse view requirement: while the model tolerates unposed inputs, extremely sparse or highly occluded captures can degrade geometry quality.
Edit scope: the current training covers four edit types; more complex structural changes (e.g., geometry deformation) remain challenging.
Resolution: rendered outputs are limited to the network’s native voxel resolution; higher‑fidelity rendering would need a downstream up‑sampling stage.
Future directions suggested by the authors include extending the asymmetric input paradigm to handle video streams, incorporating explicit pose estimation to boost accuracy in edge cases, and scaling the model to support full‑scene geometry edits.

Authors

Jiageng Liu
Weijie Lyu
Xueting Li
Yejie Guo
Ming-Hsuan Yang

Paper Information

arXiv ID: 2512.25071v1
Categories: cs.CV
Published: December 31, 2025
PDF: Download PDF

[Paper] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Fusion-SSAT: Unleashing the Potential of Self-supervised Auxiliary Task by Feature Fusion for Generalized Deepfake Detection

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing