[Paper] LayerGS: Decomposition and Inpainting of Layered 3D Human Avatars via 2D Gaussian Splatting

Published: 1 month ago (January 9, 2026 at 10:30 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.05853v1

Overview

A new framework called LayerGS lets you turn a single video of a person into a fully animatable, multi‑layer 3D avatar—separating the body from each garment. By representing each layer with 2‑D Gaussian splats and using a diffusion model to “paint in” hidden parts, the system produces photorealistic renderings that stay consistent across novel poses and viewpoints, opening the door to realistic virtual try‑on and immersive avatar creation.

Key Contributions

Layer‑wise Gaussian Splatting: Encodes the body and each clothing item as independent collections of 2‑D Gaussians, preserving fine geometry while keeping rendering fast and memory‑efficient.
Diffusion‑based Inpainting: Leverages a pretrained 2‑D diffusion model (via Score‑Distillation Sampling) to fill occluded garment regions that are never seen in the input video.
Three‑Stage Training Pipeline:
1. Coarse canonical garment reconstruction (single‑layer).
2. Joint multi‑layer optimization that refines both body and outer‑layer details.
3. Final fine‑tuning with diffusion‑driven inpainting.
State‑of‑the‑Art Results: Outperforms prior single‑layer and multi‑layer methods on 4D‑Dress and Thuman2.0 benchmarks in both visual quality and quantitative decomposition metrics.
Open‑Source Release: Full code and pretrained models are publicly available, encouraging rapid adoption and further research.

Methodology

Data Capture: A short video of a person in arbitrary poses is processed to extract multi‑view images and a rough canonical pose.
Gaussian Splatting per Layer:
- Each layer (body, shirt, pants, etc.) is modeled as a set of 2‑D Gaussian primitives placed in 3‑D space.
- Gaussians are lightweight to render—just a weighted sum of splatted blobs—yet they can capture high‑frequency surface detail when densely sampled.
Stage‑1: Coarse Single‑Layer Reconstruction
- A vanilla Gaussian‑splatting pipeline builds a rough “canonical garment” mesh, providing an initial geometry for the outermost clothing layer.
Stage‑2: Multi‑Layer Joint Optimization
- The body layer and outer garment layers are simultaneously optimized.
- A differentiable renderer back‑propagates photometric loss (color, silhouette) while enforcing inter‑layer consistency (e.g., no interpenetration).
Stage‑3: Diffusion‑Driven Inpainting
- Hidden garment regions (e.g., the back of a shirt never seen) are filled using a pretrained 2‑D diffusion model.
- Score‑Distillation Sampling (SDS) treats the diffusion model as a loss function, nudging the Gaussian parameters toward textures that the diffusion model deems plausible.
Animation & Re‑posing: The canonical layers are rigged with a standard skeletal skinning pipeline, allowing the avatar to be posed arbitrarily while preserving the learned layer separation.

Results & Findings

Visual Fidelity: Rendered avatars show crisp edges, realistic fabric shading, and accurate inter‑layer occlusion, even from extreme viewpoints.
Quantitative Gains: On 4D‑Dress, LayerGS improves PSNR by ~1.2 dB and reduces LPIPS by ~15 % compared to the previous best multi‑layer method.
Robust Occlusion Handling: The diffusion‑inpainting step successfully reconstructs unseen garment parts, verified by a user study where participants could not reliably tell whether a region was captured or synthesized.
Real‑Time Rendering: Thanks to the Gaussian splat representation, interactive frame rates (>30 fps) are achievable on a modern GPU, making the approach practical for live applications.

Practical Implications

Virtual Try‑On & E‑Commerce: Brands can generate a reusable 3‑D model of a customer’s body and overlay any number of clothing layers, enabling realistic fit previews without needing a full body scan.
Game & Metaverse Avatars: Developers can create high‑quality, animatable avatars from a short video, reducing the cost and time of asset production while keeping the flexibility to swap outfits on the fly.
AR/VR Content Creation: The lightweight Gaussian representation fits well with mobile and headset GPUs, allowing on‑device avatar rendering for immersive experiences.
Digital Twins & Simulation: Accurate separation of body and garments opens up physics‑based simulation (e.g., cloth draping) on top of a static body mesh without re‑training the whole model.

Limitations & Future Work

Dependence on Diffusion Model Quality: Inpainting quality is bounded by the pretrained diffusion model’s training data; exotic fabrics or patterns may be rendered inaccurately.
Single‑Person Capture: The current pipeline assumes a single subject per video; extending to multi‑person scenes would require additional segmentation handling.
Rigidity of Gaussian Density: While efficient, Gaussians can struggle with extremely fine details (e.g., lace) compared to mesh‑based representations.
Future Directions: The authors suggest integrating a learnable clothing physics layer, exploring multi‑person decomposition, and fine‑tuning diffusion models on domain‑specific garment datasets to improve texture realism.

Authors

Yinghan Xu
John Dingliana

Paper Information

arXiv ID: 2601.05853v1
Categories: cs.CV, cs.AI, cs.GR
Published: January 9, 2026
PDF: Download PDF

[Paper] LayerGS: Decomposition and Inpainting of Layered 3D Human Avatars via 2D Gaussian Splatting

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

[Paper] Performance of a Deep Learning-Based Segmentation Model for Pancreatic Tumors on Public Endoscopic Ultrasound Datasets

[Paper] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

[Paper] Learning Latent Action World Models In The Wild

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame &amp; Scale Prediction

[Paper] Performance of a Deep Learning-Based Segmentation Model for Pancreatic Tumors on Public Endoscopic Ultrasound Datasets

[Paper] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

[Paper] Learning Latent Action World Models In The Wild

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction