[Paper] Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture
Source: arXiv - 2512.08738v1
Overview
The paper introduces Sign Language Spotting, a new task that asks a system to determine whether a short query sign appears inside a longer, continuous sign‑language video. By operating directly on body‑pose keypoints instead of raw video frames, the authors present an efficient, end‑to‑end encoder that can answer this binary “present/absent” question with competitive accuracy.
Key Contributions
- Task definition – Formalizes sign language spotting as a retrieval problem distinct from full‑sentence gloss recognition.
- Pose‑only pipeline – Uses 2‑D/3‑D skeletal keypoints as the sole input, eliminating the need for expensive RGB processing and reducing visual noise (e.g., background, lighting).
- Encoder‑only architecture – A lightweight transformer‑style encoder coupled with a binary classification head, trained end‑to‑end without intermediate gloss or text supervision.
- Benchmark results – Achieves 61.88 % accuracy and 60.00 % F1 on the WSLP 2025 “Word Presence Prediction” dataset, establishing a strong baseline for future work.
- Open‑source release – Code and pretrained models are publicly available, encouraging reproducibility and community extensions.
Methodology
- Pose extraction – Each video frame is processed by an off‑the‑shelf pose estimator (e.g., OpenPose, MediaPipe) to obtain a sequence of keypoint vectors (joint coordinates + confidence scores).
- Temporal encoding – The keypoint sequences from the query and the target videos are concatenated and fed into a shared transformer encoder. Positional embeddings capture the order of frames, while self‑attention lets the model relate motion patterns across the two streams.
- Binary classification head – The encoder’s final hidden state is pooled (e.g., mean‑pool) and passed through a small MLP that outputs a single sigmoid score indicating “query present”.
- Training – The model is trained with binary cross‑entropy loss on labeled pairs (positive = query appears, negative = it does not). No gloss annotations or language models are required.
The entire pipeline runs on pose data alone, which dramatically cuts memory usage and inference latency compared with RGB‑based CNN‑RNN hybrids.
Results & Findings
| Metric | Value |
|---|---|
| Accuracy | 61.88 % |
| F1‑score | 60.00 % |
| Model size | ~12 M parameters (≈ 45 MB) |
| Inference speed | ~120 fps on a single RTX 3080 (pose input) |
- The pose‑only model outperforms a baseline RGB‑based 3‑D CNN that was trained on the same task, confirming that skeletal motion carries the most discriminative information for spotting.
- Ablation studies show that removing self‑attention or using only the query (no target context) drops performance by > 10 %, highlighting the importance of joint temporal modeling.
- The system is robust to variations in signer appearance and background, thanks to the abstraction provided by pose keypoints.
Practical Implications
- Real‑time sign retrieval – Developers can embed the model in video‑search tools for sign‑language archives, enabling instant lookup of specific signs without manual annotation.
- Assistive interfaces – Mobile or web apps could alert deaf users when a particular sign (e.g., a warning or a brand name) appears in live video streams, enhancing accessibility.
- Low‑resource deployment – Since only pose data is needed, the solution can run on edge devices (smartphones, AR glasses) with modest compute budgets, opening doors to on‑device sign‑language verification.
- Data annotation aid – Automatic spotting can pre‑filter long recordings, allowing human annotators to focus on confirming or correcting detections, accelerating dataset creation for downstream ASLR tasks.
Limitations & Future Work
- Pose quality dependency – The model’s accuracy hinges on reliable keypoint detection; occlusions, extreme camera angles, or low‑resolution footage can degrade performance.
- Binary scope – The current formulation only answers “present/absent”. Extending to multi‑class spotting (identifying which sign) or handling overlapping signs remains open.
- Temporal granularity – Spotting is performed at the video‑level; finer localization (exact start/end frames) is not addressed.
- Dataset size – The WSLP 2025 benchmark is relatively small; larger, more diverse corpora are needed to assess generalization across sign languages and signing styles.
Future research directions include integrating pose‑plus‑hand‑shape cues, leveraging self‑supervised pretraining on massive unlabeled sign videos, and exploring hierarchical models that jointly perform spotting and full‑sentence translation.
Authors
- Samuel Ebimobowei Johnny
- Blessed Guda
- Emmanuel Enejo Aaron
- Assane Gueye
Paper Information
- arXiv ID: 2512.08738v1
- Categories: cs.CV, cs.CL
- Published: December 9, 2025
- PDF: Download PDF