[Paper] Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
Source: arXiv - 2603.09930v1
Overview
This paper tackles text‑motion retrieval, the problem of finding 3D human motion clips that match a natural‑language query (and vice‑versa). By moving away from coarse, global embeddings and instead modeling fine‑grained joint‑level correspondences, the authors achieve higher accuracy and give developers a way to see why a particular motion was retrieved.
Key Contributions
- Joint‑Angle Motion Images: Converts per‑joint angular data into a structured pseudo‑image that can be processed by off‑the‑shelf Vision Transformers (ViT).
- Token‑Patch Late Interaction (MaxSim): A token‑wise similarity scoring scheme that matches text tokens to motion “patches” after encoding, preserving local semantics.
- Masked Language Modeling (MLM) Regularization: Encourages the text encoder to learn robust, context‑aware token representations, improving alignment with motion patches.
- Interpretability: Provides explicit heat‑maps linking words (e.g., “raise”, “left arm”) to specific joint‑angle regions, making retrieval results explainable.
- State‑of‑the‑art Performance: Sets new benchmarks on HumanML3D and KIT‑ML datasets, surpassing previous dual‑encoder methods.
Methodology
-
Motion Representation
- Each frame of a motion sequence is expressed as a set of joint angles (rather than raw 3D coordinates).
- Angles are arranged into a 2‑D grid where rows correspond to joints and columns to time steps, forming a motion image.
- This image is fed into a pre‑trained Vision Transformer, leveraging its strong spatial‑temporal pattern learning without training from scratch.
-
Text Encoding & Late Interaction
- A transformer‑based language model encodes the query into a sequence of token embeddings.
- Instead of collapsing the motion image into a single vector, the method keeps the ViT’s patch embeddings (each patch ≈ a joint‑time region).
- MaxSim computes the cosine similarity between every text token and every motion patch, then takes the maximum per token and averages them—effectively a “soft‑max” over the best local matches.
-
MLM Regularization
- During training, random words in the query are masked and the model must predict them, forcing the text encoder to capture richer contextual cues that align better with motion patches.
-
Training Objective
- A contrastive loss pulls matching text‑motion pairs together while pushing mismatched pairs apart, applied on the MaxSim scores.
- The MLM loss is added as an auxiliary term.
Results & Findings
| Dataset | Recall@1 ↑ | Recall@5 ↑ | Median Rank ↓ |
|---|---|---|---|
| HumanML3D | 45.2% (vs. 38.7% prior) | 78.1% (vs. 70.4%) | 12 (vs. 18) |
| KIT‑ML | 41.5% (vs. 34.2%) | 73.9% (vs. 66.0%) | 15 (vs. 22) |
- Fine‑grained alignment yields a 6–8% absolute gain in top‑1 recall.
- Visualizations show clear word‑to‑joint mappings (e.g., “kick” lights up the knee/ankle region).
- Ablation studies confirm that both the motion‑image format and the MaxSim interaction contribute most of the performance boost; removing MLM drops recall by ~2%.
Practical Implications
- Game Development & Animation: Artists can type “a character waves hello with the right hand” and instantly retrieve a matching motion clip, cutting down manual search time.
- VR/AR Interaction Design: Real‑time systems can map voice commands to precise avatar motions, improving responsiveness and user immersion.
- Robotics & Motion Planning: The joint‑angle image representation aligns naturally with robot joint space, enabling language‑driven motion synthesis for humanoid robots.
- Explainable AI: The heat‑map visualizations help developers debug why a retrieval succeeded or failed, fostering trust in AI‑assisted pipelines.
Limitations & Future Work
- Dataset Bias: Experiments are limited to two benchmark corpora; performance on more diverse or noisy real‑world motion capture data remains untested.
- Computation Overhead: Keeping all patch embeddings for late interaction increases memory usage compared to a single global vector.
- Temporal Granularity: The current grid treats time uniformly; actions with variable speed may need adaptive temporal pooling.
- Future Directions suggested by the authors include extending the framework to multimodal queries (e.g., text + audio), exploring hierarchical patch representations for longer sequences, and integrating diffusion‑based motion generation for end‑to‑end text‑to‑motion pipelines.
Authors
- Yao Zhang
- Zhuchenyang Liu
- Yanlan He
- Thomas Ploetz
- Yu Xiao
Paper Information
- arXiv ID: 2603.09930v1
- Categories: cs.CV, cs.IR
- Published: March 10, 2026
- PDF: Download PDF