From Pixels to Predictions: Data Pipelines and Training the Sequence Model (Part 2)

Published: (April 17, 2026 at 07:21 PM EDT)
4 min read
Source: Dev.to

Source: Dev.to

Introduction

In Part 1 of this series we introduced the architecture of the ASL‑to‑voice translation system—a five‑stage pipeline that turns real‑time webcam video into spoken English. A machine‑learning model is only as good as the data it learns from, and raw video is often too noisy, heavy, and unstructured to be useful directly. This article dives into the data layer: extracting meaningful signals from raw video, normalizing them for robust inference, and training the temporal sequence model.

Datasets

The project supports several public datasets:

  • WLASL (Word‑Level American Sign Language) – over 2 000 signs performed by more than 100 signers. Used as the primary baseline, often starting with a top‑50 sign subset for rapid iteration.
  • RWTH‑PHOENIX‑2014T – continuous German Sign Language with rich gloss annotations.
  • How2Sign – a large‑scale, continuous ASL dataset.

Custom scripts (e.g., scripts/download_wlasl.py) automatically scrape, organize, and format these datasets for the extraction phase.

From Pixels to Keypoints

Passing raw RGB frames directly into a temporal model (e.g., a 3D CNN or Vision Transformer) requires massive computational power and high‑end GPUs. Because the goal is real‑time inference on consumer hardware, we use skeletonization.

MediaPipe Holistic

Google’s MediaPipe Holistic framework processes video frame‑by‑frame, extracting 3D coordinates (x, y, z) of specific landmarks on the human body.

Feature Vector Construction

In models/keypoint_extractor.py we build a dense feature vector for every frame:

  • Hands: 21 landmarks per hand × 3 dimensions = 126 dimensions.
  • Pose (Body): 33 landmarks × 4 dimensions (including visibility) = 132 dimensions.
  • Face: Full face mesh = 468 points (1 404 dimensions). This is often overkill; a configuration toggle can extract only the mouth subset (~20 landmarks ≈ 60 dimensions), which is critical for non‑manual markers in ASL.

By default, millions of pixels are compressed into a 1 662‑dimensional vector per frame (including the full face mesh).

Normalization

If the model trains on a person standing in the center of the frame, it will fail when the user stands in the bottom‑left corner. To make the data translation‑invariant, we implement shoulder‑based normalization:

  1. Compute the midpoint between the left and right shoulder landmarks (Pose points 11 and 12).
  2. Translate all other keypoints so that this shoulder midpoint becomes the origin (0, 0, 0).

The model then only cares about how the hands and face move relative to the body, not where the body is in the camera frame.

Model Architecture

With videos converted into sequences of normalized 1 662‑dimensional vectors, we train a Transformer Encoder defined in models/sequence_model.py.

Why a Transformer?

Recurrent Neural Networks (e.g., our BiLSTM baseline) handle sequence data, but Transformers excel at modeling long‑range dependencies and parallelize efficiently on modern hardware.

Default Configuration (config.yaml)

  • Input Projection: Linear layer scaling the 1 662‑dim input to the model’s hidden dimension (e.g., 256).
  • Positional Encoding: Standard sinusoidal encodings injected so self‑attention understands temporal order.
  • Encoder Blocks: 6 layers of multi‑head self‑attention (8 heads) to capture the full context of the sign.
  • CTC Head: Final linear layer projecting the hidden state to the vocabulary size, followed by a log‑softmax activation.

Training

Continuous sign language lacks explicit boundaries for each sign; we only know the video contains certain glosses (e.g., ["HELLO", "WORLD"]). To handle this alignment problem we use Connectionist Temporal Classification (CTC) loss. CTC introduces a special “ token, allowing the model to predict blanks during transitions and spike the probability of a specific sign when recognized.

The training script training/train_sequence.py employs:

import torch.nn as nn
import torch.optim as optim

criterion = nn.CTCLoss(zero_infinity=True)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min')
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Evaluation

Standard loss metrics are insufficient. We evaluate models using Word Error Rate (WER) via the jiwer library. WER counts insertions, deletions, and substitutions needed to turn the predicted gloss sequence into the ground‑truth sequence; lower WER indicates better performance.

Next Steps

We now have a trained Transformer model capable of converting a sequence of keypoints into a sequence of gloss probabilities. In Part 3 we will explore the real‑time inference loop, sliding‑window processing, and how to translate robotic glosses into natural spoken English.

0 views
Back to Blog

Related posts

Read more »