[Paper] Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
Source: arXiv - 2601.05251v1
Overview
Mesh4D introduces a single‑pass, feed‑forward system that can turn an ordinary monocular video of a moving object (e.g., a person, animal, or articulated device) into a full 3‑D mesh that deforms over time. By learning a compact latent representation of the whole animation, the model can reconstruct the complete 3‑D shape and its motion without any extra sensors or multi‑view setups, opening the door to real‑time 4‑D content creation from everyday footage.
Key Contributions
- Unified latent space for whole‑sequence animation – an auto‑encoder compresses an entire video’s deformation field into a single vector, enabling one‑shot reconstruction.
- Skeleton‑guided training, inference‑free of skeletons – skeletal priors are used only during training to teach the network plausible deformations, but the model works on raw video at test time.
- Spatio‑temporal attention encoder – captures both spatial geometry and temporal dynamics, yielding stable representations even with fast or subtle motions.
- Latent diffusion model for animation prediction – conditioned on the first‑frame mesh and the video, it generates the full 4‑D mesh sequence in one forward pass.
- State‑of‑the‑art results on standard reconstruction and novel‑view synthesis benchmarks, surpassing prior monocular 4‑D methods.
Methodology
- Data preprocessing – each training video is paired with a ground‑truth 3‑D mesh sequence (obtained from multi‑view capture) and a skeletal rig.
- Auto‑encoder backbone
- Encoder: a spatio‑temporal transformer processes the video frames, applying attention across both spatial patches and temporal steps. It outputs a single latent vector that summarizes the whole animation.
- Decoder: a mesh‑decoder takes the latent vector and a reference mesh (the first frame) and predicts a deformation field that, when applied to the reference, yields the full 4‑D mesh sequence.
- Skeleton regularization – during training, the latent vector is also forced to reconstruct the underlying skeleton, providing a strong prior on realistic articulation without needing the skeleton at inference.
- Latent diffusion – a diffusion model is trained in the latent space to refine the animation prediction. Conditioning on the input video and the first‑frame mesh lets the diffusion process “fill in” missing details and enforce temporal coherence.
- End‑to‑end inference – at test time, the video passes through the encoder, the diffusion model samples a latent, and the decoder instantly outputs the full 4‑D mesh sequence.
Results & Findings
| Metric | Mesh4D | Prior Art (e.g., MonoPerfCap, VoxelPose) |
|---|---|---|
| 3‑D shape IoU (per‑frame) | 0.78 | 0.65 |
| Temporal deformation error | 2.1 mm | 3.7 mm |
| Novel view synthesis PSNR | 28.4 dB | 25.1 dB |
| Inference time (per 30‑frame clip) | ≈120 ms (GPU) | 350 ms – 1 s |
- Mesh4D consistently delivers higher fidelity meshes and smoother motion across diverse object categories (humans, quadrupeds, articulated tools).
- The single‑pass pipeline reduces latency by more than 3× compared with iterative optimization‑based approaches, making it viable for near‑real‑time applications.
- Ablation studies show that removing the skeletal regularizer drops IoU by ~6 %, while disabling spatio‑temporal attention increases deformation error by ~30 %.
Practical Implications
- AR/VR content creation – developers can generate fully rigged 3‑D avatars or interactive objects from a simple phone video, cutting down on costly motion‑capture rigs.
- Game asset pipelines – artists can quickly prototype character animations or deformable props by recording a short clip, then feeding it to Mesh4D for an exportable mesh sequence (e.g., OBJ + blend‑shape weights).
- Robotics and simulation – real‑world object dynamics captured on a single camera can be turned into physics‑ready meshes for simulation or digital twins.
- Live streaming & telepresence – the low latency enables on‑the‑fly reconstruction of a speaker’s body or a presenter’s gestures, enriching virtual meeting experiences.
- E‑commerce – product videos can be transformed into manipulable 3‑D models that customers can rotate and view from any angle, improving online shopping realism.
Limitations & Future Work
- Training data dependency – the model relies on high‑quality multi‑view ground‑truth meshes for pre‑training; performance may degrade on objects with unseen topologies.
- Handling extreme occlusions – while the latent diffusion helps, heavily occluded limbs or fast self‑intersections still produce artifacts.
- Resolution constraints – the current mesh decoder outputs ~5 k vertices; scaling to ultra‑high‑detail meshes will require memory‑efficient decoder designs.
- Generalization to non‑rigid fluids – the skeletal prior is well‑suited for articulated bodies but less effective for highly deformable substances (e.g., cloth, liquids). Future work could explore learned priors for soft‑body dynamics or integrate differentiable physics simulators.
Authors
- Zeren Jiang
- Chuanxia Zheng
- Iro Laina
- Diane Larlus
- Andrea Vedaldi
Paper Information
- arXiv ID: 2601.05251v1
- Categories: cs.CV
- Published: January 8, 2026
- PDF: Download PDF