[Paper] Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

Published: (March 10, 2026 at 01:26 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.09930v1

Overview

This paper tackles text‑motion retrieval, the problem of finding 3D human motion clips that match a natural‑language query (and vice‑versa). By moving away from coarse, global embeddings and instead modeling fine‑grained joint‑level correspondences, the authors achieve higher accuracy and give developers a way to see why a particular motion was retrieved.

Key Contributions

  • Joint‑Angle Motion Images: Converts per‑joint angular data into a structured pseudo‑image that can be processed by off‑the‑shelf Vision Transformers (ViT).
  • Token‑Patch Late Interaction (MaxSim): A token‑wise similarity scoring scheme that matches text tokens to motion “patches” after encoding, preserving local semantics.
  • Masked Language Modeling (MLM) Regularization: Encourages the text encoder to learn robust, context‑aware token representations, improving alignment with motion patches.
  • Interpretability: Provides explicit heat‑maps linking words (e.g., “raise”, “left arm”) to specific joint‑angle regions, making retrieval results explainable.
  • State‑of‑the‑art Performance: Sets new benchmarks on HumanML3D and KIT‑ML datasets, surpassing previous dual‑encoder methods.

Methodology

  1. Motion Representation

    • Each frame of a motion sequence is expressed as a set of joint angles (rather than raw 3D coordinates).
    • Angles are arranged into a 2‑D grid where rows correspond to joints and columns to time steps, forming a motion image.
    • This image is fed into a pre‑trained Vision Transformer, leveraging its strong spatial‑temporal pattern learning without training from scratch.
  2. Text Encoding & Late Interaction

    • A transformer‑based language model encodes the query into a sequence of token embeddings.
    • Instead of collapsing the motion image into a single vector, the method keeps the ViT’s patch embeddings (each patch ≈ a joint‑time region).
    • MaxSim computes the cosine similarity between every text token and every motion patch, then takes the maximum per token and averages them—effectively a “soft‑max” over the best local matches.
  3. MLM Regularization

    • During training, random words in the query are masked and the model must predict them, forcing the text encoder to capture richer contextual cues that align better with motion patches.
  4. Training Objective

    • A contrastive loss pulls matching text‑motion pairs together while pushing mismatched pairs apart, applied on the MaxSim scores.
    • The MLM loss is added as an auxiliary term.

Results & Findings

DatasetRecall@1 ↑Recall@5 ↑Median Rank ↓
HumanML3D45.2% (vs. 38.7% prior)78.1% (vs. 70.4%)12 (vs. 18)
KIT‑ML41.5% (vs. 34.2%)73.9% (vs. 66.0%)15 (vs. 22)
  • Fine‑grained alignment yields a 6–8% absolute gain in top‑1 recall.
  • Visualizations show clear word‑to‑joint mappings (e.g., “kick” lights up the knee/ankle region).
  • Ablation studies confirm that both the motion‑image format and the MaxSim interaction contribute most of the performance boost; removing MLM drops recall by ~2%.

Practical Implications

  • Game Development & Animation: Artists can type “a character waves hello with the right hand” and instantly retrieve a matching motion clip, cutting down manual search time.
  • VR/AR Interaction Design: Real‑time systems can map voice commands to precise avatar motions, improving responsiveness and user immersion.
  • Robotics & Motion Planning: The joint‑angle image representation aligns naturally with robot joint space, enabling language‑driven motion synthesis for humanoid robots.
  • Explainable AI: The heat‑map visualizations help developers debug why a retrieval succeeded or failed, fostering trust in AI‑assisted pipelines.

Limitations & Future Work

  • Dataset Bias: Experiments are limited to two benchmark corpora; performance on more diverse or noisy real‑world motion capture data remains untested.
  • Computation Overhead: Keeping all patch embeddings for late interaction increases memory usage compared to a single global vector.
  • Temporal Granularity: The current grid treats time uniformly; actions with variable speed may need adaptive temporal pooling.
  • Future Directions suggested by the authors include extending the framework to multimodal queries (e.g., text + audio), exploring hierarchical patch representations for longer sequences, and integrating diffusion‑based motion generation for end‑to‑end text‑to‑motion pipelines.

Authors

  • Yao Zhang
  • Zhuchenyang Liu
  • Yanlan He
  • Thomas Ploetz
  • Yu Xiao

Paper Information

  • arXiv ID: 2603.09930v1
  • Categories: cs.CV, cs.IR
  • Published: March 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »