[Paper] FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion

Published: (December 31, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.25067v1

Overview

FineTec tackles a real‑world pain point for developers working with pose‑based AI: recognizing subtle, fine‑grained human actions when the input skeleton data is riddled with missing frames or noisy joints. By blending smart sequence‑completion, a physics‑inspired motion model, and graph‑neural‑network (GCN) classification, the framework restores corrupted skeleton streams and extracts the delicate motion cues needed to tell apart very similar actions.

Key Contributions

  • Unified corruption‑robust pipeline – integrates temporal in‑painting, spatial decomposition, and physics‑driven dynamics into a single end‑to‑end model.
  • Context‑aware sequence completion – uses diverse temporal masking to train a completion module that can reconstruct missing joints across a wide range of corruption levels.
  • Semantic skeleton decomposition – automatically splits the human skeleton into five body regions and further into dynamic vs. static joint groups based on motion variance, enabling targeted data augmentation.
  • Lagrangian dynamics estimator – computes joint accelerations from the restored positions, providing a physics‑grounded feature that complements raw joint coordinates.
  • Joint position + acceleration GCN head – fuses spatial and dynamic cues in a graph‑convolutional network, delivering state‑of‑the‑art accuracy on both coarse‑ and fine‑grained benchmarks under severe temporal corruption.

Methodology

  1. Temporal Corruption Modeling – During training, the raw skeleton sequence is randomly masked in time (e.g., dropping whole frames or individual joint observations) to simulate the kinds of gaps produced by online pose estimators.
  2. Base Sequence Restoration – A transformer‑style encoder‑decoder learns to fill in the missing parts using surrounding context, producing a base skeleton stream that is already more complete than the raw input.
  3. Spatial Decomposition & Augmentation
    • The skeleton is partitioned into five semantic regions (head‑torso, left/right arms, left/right legs).
    • Within each region, joints are classified as dynamic (high variance) or static (low variance).
    • Two auxiliary streams are generated: one where dynamic joints are slightly perturbed (to encourage robustness) and another where static joints are perturbed (to expose hidden discriminative cues).
  4. Physics‑Driven Estimation – Leveraging Lagrangian mechanics, the model estimates joint accelerations from the three streams (base + two augmentations). This step injects a physically meaningful representation of motion that is less sensitive to missing data.
  5. GCN‑Based Recognition Head – The fused position sequence and the fused acceleration sequence are fed into a graph convolutional network that respects the natural connectivity of the human skeleton, outputting the final action class.

The whole system is trained end‑to‑end, so the completion, decomposition, and dynamics modules co‑adapt to maximize classification performance.

Results & Findings

Dataset (corruption level)Top‑1 Accuracy (FineTec)Best priorGain
NTU‑60 (standard)96.4 %94.7 %+1.7 %
NTU‑120 (standard)94.2 %92.5 %+1.7 %
Gym99 – severe corruption89.1 %81.3 %+7.8 %
Gym288 – severe corruption78.1 %70.4 %+7.7 %
  • FineTec’s advantage grows as corruption worsens, confirming that the completion + physics pipeline is especially effective when data loss is extreme.
  • Ablation studies show that removing any of the three pillars (completion, decomposition, acceleration) drops performance by 3–5 %, highlighting their complementary nature.
  • The model generalizes across coarse‑grained (NTU) and fine‑grained (Gym) tasks without task‑specific tuning, indicating a robust, reusable backbone for skeleton‑based perception.

Practical Implications

  • Robust real‑time analytics – Developers building surveillance, sports analytics, or AR/VR experiences can now rely on skeleton‑based action classifiers even when the upstream pose estimator drops frames (e.g., due to occlusion or low‑light conditions).
  • Edge deployment – The core components (a lightweight transformer for completion and a GCN) can be quantized and run on modern edge AI chips, enabling on‑device inference without sending raw video to the cloud.
  • Data‑efficient fine‑tuning – Because FineTec learns to fill gaps, it reduces the need for painstaking manual annotation of clean skeleton data; developers can train on noisy, in‑the‑wild captures and still achieve high accuracy.
  • Cross‑modal extensions – The physics‑driven acceleration stream can be fused with other modalities (e.g., audio or inertial sensors) to build multimodal activity‑recognition pipelines that are resilient to any single sensor’s failure.

Limitations & Future Work

  • Computation overhead – The temporal completion transformer and the Lagrangian estimator add latency compared with a vanilla GCN; real‑time constraints may require model pruning or distillation.
  • Assumption of skeletal topology – The decomposition relies on a fixed 25‑joint skeleton; adapting to alternative pose representations (e.g., dense mesh or hand‑only keypoints) would need redesign.
  • Limited exploration of extreme occlusions – While the paper simulates temporal masking, real‑world occlusions often produce correlated missing joints (e.g., an entire limb). Future work could incorporate spatial‑masking strategies and multimodal priors (RGB, depth) to further boost robustness.

FineTec opens the door to reliable fine‑grained action understanding even when the input is messy—a scenario that mirrors the noisy data pipelines most developers face today.

Authors

  • Dian Shao
  • Mingfei Shi
  • Like Liu

Paper Information

  • arXiv ID: 2512.25067v1
  • Categories: cs.CV
  • Published: December 31, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »