[Paper] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

Published: (November 26, 2025 at 11:00 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21533v1

Overview

Bangla Sign Language Translation (BdSLT) has long suffered from a lack of data, making it hard to build reliable AI assistants for the Bangla‑speaking deaf community. This paper introduces IsharaKhobor, the first sizable, publicly released Bangla sign‑language dataset, and shows how different preprocessing tricks affect translation performance.

Key Contributions

  • IsharaKhobor dataset: ~5 k video clips of Bangla sign sentences with aligned textual translations, released on Kaggle.
  • Two curated subsets:
    • IsharaKhobor_small: a vocabulary‑restricted version for low‑resource experiments.
    • IsharaKhobor_canonical_small: same as above but with canonicalized (standardized) glosses.
  • Dataset creation pipeline: detailed discussion of annotation workflow, quality‑control, and the linguistic challenges unique to Bangla sign language.
  • Benchmarking suite: baseline models using landmark‑based raw video features and a recent RQE (Relation‑Query‑Embedding) approach, plus ablation studies on vocabulary size and canonicalization.
  • Open‑source release: data, preprocessing scripts, and evaluation code are all publicly available, encouraging reproducibility and community contributions.

Methodology

  1. Data collection – Native BdSL signers recorded short sentences (5‑15 s) covering everyday topics. Each video was captured with a single RGB camera under consistent lighting.
  2. Annotation – Professional Bangla linguists transcribed the signed content into textual sentences and also generated glosses (a word‑by‑word representation of the signs).
  3. Pre‑processing
    • Landmark extraction: OpenPose was used to extract 2‑D hand, body, and facial keypoints (≈ 150 points per frame).
    • RQE embedding: A transformer‑based encoder that learns relational queries over the spatio‑temporal landmark sequence.
    • Vocabulary restriction: Only the most frequent 1 k glosses were kept for the “small” subsets.
    • Canonicalization: Glosses were normalized (e.g., merging synonyms, fixing spelling) to reduce noise.
  4. Modeling – A sequence‑to‑sequence architecture (Encoder‑Decoder with attention) was trained on the raw landmarks and on the RQE embeddings. Standard metrics (BLEU, ROUGE, METEOR) measured translation quality.
  5. Ablation – Experiments compared: full vs. small vocabularies, raw vs. canonicalized glosses, and landmark vs. RQE features.

Results & Findings

ExperimentBLEU ↑ROUGE‑L ↑METEOR ↑
Full dataset (landmarks)21.438.719.2
Full dataset (RQE)24.141.222.0
Small vocab (landmarks)18.935.417.5
Small vocab (canonical)20.637.119.0
  • RQE embeddings consistently outperformed raw landmarks, confirming that relational modeling captures sign dynamics better than raw keypoints.
  • Canonicalization gave a modest boost (≈ 1.5 BLEU) by reducing gloss ambiguity.
  • Vocabulary restriction lowered performance, but the gap narrowed when combined with canonicalization, suggesting a viable path for ultra‑low‑resource scenarios.

Practical Implications

  • Assistive Apps – Developers can now prototype real‑time BdSL‑to‑text translators using the released dataset and baseline code, accelerating the creation of mobile or web‑based communication tools for Bangla‑speaking deaf users.
  • Transfer Learning – The RQE encoder can be fine‑tuned on other sign languages, offering a reusable component for multilingual sign‑language research.
  • Curriculum Design – Educators can use the curated subsets to teach machine‑learning concepts (e.g., data cleaning, low‑resource NLP) with a culturally relevant example.
  • Standardization Efforts – The canonical glosses provide a starting point for building a Bangla Sign Language lexicon, which could feed into government‑backed accessibility standards.

Limitations & Future Work

  • Scale – At ~5 k clips, IsharaKhobor is still modest compared to large‑scale sign‑language corpora; more diverse signers, environments, and sentence structures are needed.
  • Modalities – Only RGB video was captured; depth or motion‑capture data could improve hand‑shape discrimination.
  • Evaluation – BLEU‑style metrics may not fully reflect sign‑language nuances; human‑in‑the‑loop assessments are planned.
  • Modeling – The study focused on landmark‑based pipelines; future work could explore end‑to‑end video transformers or multimodal fusion with audio (for hearing‑impaired users who lip‑read).

Authors

  • Husne Ara Rubaiyeat
  • Hasan Mahmud
  • Md Kamrul Hasan

Paper Information

  • arXiv ID: 2511.21533v1
  • Categories: cs.CL, cs.CV
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »