[Paper] NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance
Source: arXiv - 2603.05462v1
Overview
A new dataset called NCTB‑QA brings large‑scale, high‑quality Bangla question‑answer pairs to the research community. Pulled from 50 national curriculum textbooks, it contains ≈ 88 k QA items with a deliberately balanced mix of answerable and unanswerable questions—something most existing Bangla QA resources lack. The paper shows that fine‑tuning modern transformer models on this domain‑specific data dramatically lifts performance, highlighting a practical path for building reliable reading‑comprehension systems in low‑resource languages.
Key Contributions
- Large, balanced Bangla QA corpus – 87,805 QA pairs with 57 % answerable and 43 % unanswerable questions, sourced from official Bangla textbooks.
- Adversarial distractors – many unanswerable items include plausible but incorrect answer candidates, forcing models to truly understand the context.
- Comprehensive benchmark – evaluation of three popular transformer architectures (BERT, RoBERTa, ELECTRA) on the new dataset, with detailed metrics (Exact Match, F1, BERTScore).
- Domain‑specific fine‑tuning recipe – demonstrates that a relatively modest amount of textbook data can yield a 313 % relative F1 gain for BERT (0.150 → 0.620).
- Open‑access release – dataset, preprocessing scripts, and trained checkpoints are publicly available, encouraging further work on Bangla NLP and low‑resource QA.
Methodology
- Data collection & cleaning – The authors harvested question‑answer pairs from PDFs of 50 textbooks published by Bangladesh’s National Curriculum and Textbook Board (NCTB). Automated OCR was followed by manual verification to ensure correct alignment of questions, answer spans, and source passages.
- Answerability labeling – Each question was tagged as answerable (the answer text appears verbatim in the passage) or unanswerable. For the latter, the team inserted “plausible distractors” that look correct but are not supported by the context.
- Dataset split – Standard train/validation/test splits (≈ 80/10/10 %) preserve the answerability ratio across splits.
- Model fine‑tuning – Pre‑trained Bangla BERT, RoBERTa, and ELECTRA models were further trained on NCTB‑QA using the typical span‑prediction head (start/end token logits) plus a binary classifier for answerability. Hyper‑parameters (learning rate, batch size, epochs) were tuned on the validation set.
- Evaluation – Traditional SQuAD‑style metrics (Exact Match, F1) assess span accuracy, while BERTScore measures semantic similarity between predicted and gold answers, providing a more forgiving view of answer quality.
Results & Findings
| Model | EM (↑) | F1 (↑) | BERTScore (↑) |
|---|---|---|---|
| BERT (baseline, no fine‑tune) | 0.12 | 0.15 | 0.42 |
| BERT (fine‑tuned on NCTB‑QA) | 0.48 | 0.62 | 0.71 |
| RoBERTa (fine‑tuned) | 0.44 | 0.58 | 0.68 |
| ELECTRA (fine‑tuned) | 0.41 | 0.55 | 0.66 |
- Huge relative gain: BERT’s F1 jumps 313 % after domain‑specific fine‑tuning.
- Answerability handling: All models learn to output “no answer” for the 43 % unanswerable items, reducing false positives dramatically.
- Semantic quality: BERTScore improvements indicate that even when exact span matches fail, the generated answers stay semantically close to the gold.
- Difficulty: Despite the gains, absolute scores remain modest compared to high‑resource English QA benchmarks, confirming that NCTB‑QA is a challenging testbed.
Practical Implications
- Educational tech – Platforms that provide automated tutoring or exam preparation for Bangla‑speaking students can now train models that recognize when a passage doesn’t contain the answer, avoiding misleading hints.
- Low‑resource NLP pipelines – The study demonstrates a repeatable recipe: gather domain‑specific, balanced data → fine‑tune a multilingual or language‑specific transformer → achieve outsized performance lifts without massive compute.
- Cross‑lingual transfer – Developers can use NCTB‑QA as a downstream task to evaluate how well English‑trained models adapt to Bangla, informing decisions about multilingual model selection.
- Robust QA services – By incorporating answerability detection, production systems can return “I don’t know” rather than fabricating answers, improving user trust—crucial for chatbots, voice assistants, and search interfaces in Bangla.
- Open resources – The released checkpoints let teams skip the heavy fine‑tuning step and directly integrate a Bangla‑aware QA head into existing pipelines (e.g., Hugging Face Transformers).
Limitations & Future Work
- Domain confinement – All passages come from school textbooks, so models may struggle with informal or domain‑specific Bangla (news, social media).
- Answer span granularity – Some gold answers are short phrases; others are longer sentences, which can bias span‑prediction metrics.
- Unanswerable design – While distractors are plausible, they are still handcrafted; real‑world queries may contain subtler ambiguities.
- Future directions suggested by the authors include expanding the corpus to cover higher‑education and non‑educational texts, exploring multilingual pre‑training to boost low‑resource performance, and investigating retrieval‑augmented QA architectures that can scale beyond a single passage.
Authors
- Abrar Eyasir
- Tahsin Ahmed
- Muhammad Ibrahim
Paper Information
- arXiv ID: 2603.05462v1
- Categories: cs.CL
- Published: March 5, 2026
- PDF: Download PDF