[Paper] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images
Source: arXiv - 2512.25071v1
Overview
Edit3r is a new feed‑forward system that can reconstruct a 3D scene and apply user‑driven edits in a single forward pass, even when the input images are sparse, unaligned, and have already been altered with 2‑D editing tools. By sidestepping the costly per‑scene optimization that dominates prior work, Edit3r makes real‑time, photorealistic 3D editing feasible for developers building AR/VR, game, and visual‑effects pipelines.
Key Contributions
- Instant 3D reconstruction & editing from a handful of unposed, view‑inconsistent images – no iterative optimization or pose estimation required.
- Cross‑view consistent supervision via a SAM2‑based recoloring pipeline that automatically generates edited multi‑view training pairs.
- Asymmetric input strategy that fuses a recolored reference view with raw auxiliary views, teaching the network to align disparate observations.
- DL3DV‑Edit‑Bench, a new benchmark (20 scenes, 4 edit types, 100 total edits) for systematic evaluation of 3‑D editing quality and speed.
- State‑of‑the‑art performance: higher semantic alignment and 3‑D consistency than recent baselines while running orders of magnitude faster.
Methodology
-
Data Preparation
- Start with unedited multi‑view images from the DL3DV dataset.
- Apply a SAM2‑driven recoloring step that propagates a 2‑D edit (e.g., “make the wall red”) consistently across all views, creating a pseudo‑ground‑truth edited set.
-
Network Architecture
- A single encoder‑decoder takes as input an asymmetric bundle: one recolored reference view + several raw views.
- The encoder learns to merge heterogeneous observations, while the decoder predicts a NeRF‑style volumetric field that already incorporates the instructed edit.
-
Training Objective
- Photometric loss between rendered novel views and the SAM2‑recolored supervision ensures cross‑view consistency.
- Semantic alignment loss (using CLIP embeddings) encourages the edited geometry to match the textual instruction.
-
Inference
- Users supply any set of sparse photos (no pose info) and a textual edit (or a 2‑D edited image from tools like InstructPix2Pix).
- The model instantly outputs a renderable 3‑D representation that reflects the edit, ready for downstream rendering or interaction.
Results & Findings
| Metric | Edit3r | Prior Optim‑Based Methods |
|---|---|---|
| Semantic Alignment (CLIP‑Score) | 0.78 | 0.62 |
| 3‑D Consistency (Multi‑View PSNR) | 28.4 dB | 24.1 dB |
| Inference Time (per scene) | ≈0.3 s | ≈30 s – 5 min |
- Qualitative examples show Edit3r accurately changing colors, adding objects, or removing elements while preserving geometry and lighting across unseen viewpoints.
- The model generalizes to edits it never saw during training (e.g., stylized sketches from InstructPix2Pix), confirming robustness to diverse 2‑D editing pipelines.
- On the newly released DL3DV‑Edit‑Bench, Edit3r consistently outperforms baselines on all four edit categories (color change, texture swap, object addition, object removal).
Practical Implications
- Real‑time AR/VR content creation: developers can let end‑users snap a few photos of a room, type “make the sofa blue”, and instantly obtain a 3‑D scene ready for rendering or physics simulation.
- Game asset pipelines: artists can rapidly prototype level edits without manually retopologizing or re‑baking textures; the feed‑forward model handles the heavy lifting.
- Visual effects & post‑production: on‑set footage can be edited on the fly, enabling quick iteration on set extensions or matte‑painting adjustments.
- Integration with existing 2‑D editors: because Edit3r works with outputs from tools like InstructPix2Pix, studios can keep their familiar 2‑D workflows while gaining 3‑D capabilities with minimal engineering effort.
Limitations & Future Work
- Sparse view requirement: while the model tolerates unposed inputs, extremely sparse or highly occluded captures can degrade geometry quality.
- Edit scope: the current training covers four edit types; more complex structural changes (e.g., geometry deformation) remain challenging.
- Resolution: rendered outputs are limited to the network’s native voxel resolution; higher‑fidelity rendering would need a downstream up‑sampling stage.
- Future directions suggested by the authors include extending the asymmetric input paradigm to handle video streams, incorporating explicit pose estimation to boost accuracy in edge cases, and scaling the model to support full‑scene geometry edits.
Authors
- Jiageng Liu
- Weijie Lyu
- Xueting Li
- Yejie Guo
- Ming-Hsuan Yang
Paper Information
- arXiv ID: 2512.25071v1
- Categories: cs.CV
- Published: December 31, 2025
- PDF: Download PDF