[Paper] Layout Anything: One Transformer for Universal Room Layout Estimation
Source: arXiv - 2512.02952v1
Overview
Layout Anything introduces a single‑transformer model that can predict the 3‑D room layout of indoor scenes directly from a single RGB image. By adapting the versatile OneFormer segmentation architecture, the authors achieve high‑quality layout estimates without the cumbersome post‑processing steps that have traditionally plagued this task, making the approach both fast (≈114 ms per image) and ready for real‑world AR/VR pipelines.
Key Contributions
- Unified Transformer Architecture – Re‑purposes OneFormer’s task‑conditioned query mechanism for geometric layout prediction, eliminating the need for separate segmentation and geometry modules.
- Layout Degeneration Augmentation – A topology‑aware data‑augmentation scheme that synthetically “degenerates” room layouts while preserving Manhattan‑world constraints, dramatically expanding training diversity.
- Differentiable Geometric Losses – Introduces planar‑consistency and sharp‑boundary losses that are fully differentiable, allowing the network to learn geometry directly rather than relying on heuristic post‑processing.
- Real‑Time Inference – Optimized end‑to‑end pipeline runs at ~114 ms per image on a single GPU, a notable speed‑up over prior state‑of‑the‑art methods.
- State‑of‑the‑Art Benchmarks – Sets new best‑in‑class numbers on LSUN, Hedau, and Matterport3D‑Layout datasets (e.g., 5.43 % pixel error on LSUN).
Methodology
- Backbone & Query Design – The model builds on OneFormer’s transformer encoder‑decoder. A set of task‑conditioned queries is injected, each specialized to predict a particular geometric primitive (walls, floor, ceiling).
- Layout Degeneration – During training, ground‑truth layouts are transformed (e.g., wall removal, corner perturbation) in a way that respects Manhattan‑world orthogonality. This yields a richer set of “hard” examples without breaking the underlying geometry.
- Geometric Losses –
- Planar Consistency Loss: Encourages points belonging to the same planar surface (wall/floor/ceiling) to have similar normal vectors.
- Sharp Boundary Loss: Penalizes blurry transitions between adjacent planes, driving the network toward crisp edge predictions.
- End‑to‑End Training – All components are differentiable, so the model learns to output a full layout map directly from the image, bypassing any separate line‑detection or clustering steps.
Results & Findings
| Dataset | Pixel Error (PE) | Corner Error (CE) |
|---|---|---|
| LSUN | 5.43 % | 4.02 % |
| Hedau | 7.04 % | 5.17 % |
| Matterport3D‑Layout | 4.03 % | 3.15 % |
- The model consistently outperforms prior methods by 0.5–2 % absolute error reduction.
- Qualitative visualizations show cleaner, more orthogonal wall boundaries and fewer spurious artifacts.
- Inference speed (≈114 ms) is roughly 2–3× faster than the previous best real‑time approaches, making it viable for on‑device AR scenarios.
Practical Implications
- Augmented Reality & Indoor Navigation – Developers can integrate the model into mobile AR apps to instantly generate room geometry for object placement, occlusion handling, or path planning.
- 3‑D Reconstruction Pipelines – The fast, accurate layout maps serve as strong priors for multi‑view or LiDAR‑augmented reconstruction, reducing the need for dense point‑cloud processing.
- Robotics & Scene Understanding – Service robots can use the layout predictions to infer traversable space and obstacle locations without expensive SLAM back‑ends.
- Content Creation – Interior design tools can auto‑generate floor plans from photos, accelerating the workflow for architects and real‑estate platforms.
Because the system is a single transformer model, it can be exported to ONNX or TensorRT and run on edge GPUs, opening the door to low‑latency, on‑device deployment.
Limitations & Future Work
- Manhattan‑World Assumption – The current design assumes orthogonal walls; highly irregular or curved interiors may degrade performance.
- Single‑Image Input – While efficient, relying on one RGB frame limits depth perception; integrating depth or multi‑view cues could boost accuracy in cluttered scenes.
- Generalization to Outdoor/Hybrid Spaces – The model is trained on indoor datasets; extending it to mixed indoor‑outdoor environments would require additional data and possibly architectural tweaks.
Future research directions include relaxing the Manhattan constraint via learned priors, incorporating depth sensors for richer geometry, and exploring lightweight transformer variants for ultra‑low‑power devices.
Authors
- Md Sohag Mia
- Muhammad Abdullah Adnan
Paper Information
- arXiv ID: 2512.02952v1
- Categories: cs.CV
- Published: December 2, 2025
- PDF: Download PDF