[Paper] LayerGS: Decomposition and Inpainting of Layered 3D Human Avatars via 2D Gaussian Splatting
Source: arXiv - 2601.05853v1
Overview
A new framework called LayerGS lets you turn a single video of a person into a fully animatable, multi‑layer 3D avatar—separating the body from each garment. By representing each layer with 2‑D Gaussian splats and using a diffusion model to “paint in” hidden parts, the system produces photorealistic renderings that stay consistent across novel poses and viewpoints, opening the door to realistic virtual try‑on and immersive avatar creation.
Key Contributions
- Layer‑wise Gaussian Splatting: Encodes the body and each clothing item as independent collections of 2‑D Gaussians, preserving fine geometry while keeping rendering fast and memory‑efficient.
- Diffusion‑based Inpainting: Leverages a pretrained 2‑D diffusion model (via Score‑Distillation Sampling) to fill occluded garment regions that are never seen in the input video.
- Three‑Stage Training Pipeline:
- Coarse canonical garment reconstruction (single‑layer).
- Joint multi‑layer optimization that refines both body and outer‑layer details.
- Final fine‑tuning with diffusion‑driven inpainting.
- State‑of‑the‑Art Results: Outperforms prior single‑layer and multi‑layer methods on 4D‑Dress and Thuman2.0 benchmarks in both visual quality and quantitative decomposition metrics.
- Open‑Source Release: Full code and pretrained models are publicly available, encouraging rapid adoption and further research.
Methodology
- Data Capture: A short video of a person in arbitrary poses is processed to extract multi‑view images and a rough canonical pose.
- Gaussian Splatting per Layer:
- Each layer (body, shirt, pants, etc.) is modeled as a set of 2‑D Gaussian primitives placed in 3‑D space.
- Gaussians are lightweight to render—just a weighted sum of splatted blobs—yet they can capture high‑frequency surface detail when densely sampled.
- Stage‑1: Coarse Single‑Layer Reconstruction
- A vanilla Gaussian‑splatting pipeline builds a rough “canonical garment” mesh, providing an initial geometry for the outermost clothing layer.
- Stage‑2: Multi‑Layer Joint Optimization
- The body layer and outer garment layers are simultaneously optimized.
- A differentiable renderer back‑propagates photometric loss (color, silhouette) while enforcing inter‑layer consistency (e.g., no interpenetration).
- Stage‑3: Diffusion‑Driven Inpainting
- Hidden garment regions (e.g., the back of a shirt never seen) are filled using a pretrained 2‑D diffusion model.
- Score‑Distillation Sampling (SDS) treats the diffusion model as a loss function, nudging the Gaussian parameters toward textures that the diffusion model deems plausible.
- Animation & Re‑posing: The canonical layers are rigged with a standard skeletal skinning pipeline, allowing the avatar to be posed arbitrarily while preserving the learned layer separation.
Results & Findings
- Visual Fidelity: Rendered avatars show crisp edges, realistic fabric shading, and accurate inter‑layer occlusion, even from extreme viewpoints.
- Quantitative Gains: On 4D‑Dress, LayerGS improves PSNR by ~1.2 dB and reduces LPIPS by ~15 % compared to the previous best multi‑layer method.
- Robust Occlusion Handling: The diffusion‑inpainting step successfully reconstructs unseen garment parts, verified by a user study where participants could not reliably tell whether a region was captured or synthesized.
- Real‑Time Rendering: Thanks to the Gaussian splat representation, interactive frame rates (>30 fps) are achievable on a modern GPU, making the approach practical for live applications.
Practical Implications
- Virtual Try‑On & E‑Commerce: Brands can generate a reusable 3‑D model of a customer’s body and overlay any number of clothing layers, enabling realistic fit previews without needing a full body scan.
- Game & Metaverse Avatars: Developers can create high‑quality, animatable avatars from a short video, reducing the cost and time of asset production while keeping the flexibility to swap outfits on the fly.
- AR/VR Content Creation: The lightweight Gaussian representation fits well with mobile and headset GPUs, allowing on‑device avatar rendering for immersive experiences.
- Digital Twins & Simulation: Accurate separation of body and garments opens up physics‑based simulation (e.g., cloth draping) on top of a static body mesh without re‑training the whole model.
Limitations & Future Work
- Dependence on Diffusion Model Quality: Inpainting quality is bounded by the pretrained diffusion model’s training data; exotic fabrics or patterns may be rendered inaccurately.
- Single‑Person Capture: The current pipeline assumes a single subject per video; extending to multi‑person scenes would require additional segmentation handling.
- Rigidity of Gaussian Density: While efficient, Gaussians can struggle with extremely fine details (e.g., lace) compared to mesh‑based representations.
- Future Directions: The authors suggest integrating a learnable clothing physics layer, exploring multi‑person decomposition, and fine‑tuning diffusion models on domain‑specific garment datasets to improve texture realism.
Authors
- Yinghan Xu
- John Dingliana
Paper Information
- arXiv ID: 2601.05853v1
- Categories: cs.CV, cs.AI, cs.GR
- Published: January 9, 2026
- PDF: Download PDF