[Paper] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

Published: (December 19, 2025 at 11:41 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.17769v1

Overview

The paper introduces Bangla MedER, a new benchmark for recognizing medical entities in Bangla text, and proposes a Multi‑BERT Ensemble model that pushes accuracy close to 90 %. By tackling the scarcity of annotated Bangla medical data, the work opens the door for NLP‑driven healthcare tools in a language that has been largely ignored by the research community.

Key Contributions

  • Bangla MedER dataset: a manually curated, high‑quality corpus of Bangla medical sentences with entity annotations (e.g., diseases, drugs, procedures).
  • Comprehensive baseline study: evaluation of several transformer families (BERT, DistilBERT, ELECTRA, RoBERTa) on the new dataset.
  • Multi‑BERT Ensemble architecture: combines predictions from multiple fine‑tuned BERT models using a voting/stacking scheme, achieving 89.58 % accuracy, an 11.80 % gain over a single‑layer BERT.
  • Extensive evaluation: reports precision, recall, F1‑score, and confusion matrices across entity types, demonstrating robustness.
  • Open‑source release: code, trained models, and the dataset are made publicly available to encourage reproducibility and further research.

Methodology

  1. Data Collection & Annotation

    • Gathered Bangla medical texts from public health portals, research articles, and clinical notes.
    • Professional annotators labeled entities such as Disease, Medication, Symptom, and Procedure following a predefined schema.
  2. Model Fine‑Tuning

    • Each transformer (BERT‑base, DistilBERT, ELECTRA‑small, RoBERTa‑base) was fine‑tuned on the Bangla MedER training split using a token‑level classification head (softmax over entity tags).
  3. Ensemble Construction

    • After individual fine‑tuning, the models’ logits for each token were aggregated.
    • Two strategies were explored:
      • Majority voting (hard ensemble) – the most common tag among models wins.
      • Stacked meta‑learner (soft ensemble) – a lightweight feed‑forward network learns to weight each model’s confidence scores.
    • The stacked approach yielded the best performance and is referred to as the Multi‑BERT Ensemble.
  4. Evaluation

    • Standard NER metrics (precision, recall, F1) were computed per entity class and overall.
    • A held‑out test set ensured that the ensemble’s gains were not due to overfitting.

Results & Findings

ModelAccuracyMacro‑F1
BERT‑base (single layer)77.78 %0.73
DistilBERT80.12 %0.75
ELECTRA‑small81.45 %0.77
RoBERTa‑base82.30 %0.78
Multi‑BERT Ensemble89.58 %0.86
  • The ensemble outperformed the strongest single model (RoBERTa) by 7.28 % in accuracy and 0.09 in macro‑F1.
  • Gains were especially pronounced for low‑frequency entities (e.g., Procedure), where the ensemble mitigated individual model biases.
  • Error analysis showed that most remaining mistakes stem from ambiguous phrasing and domain‑specific abbreviations not seen during training.

Practical Implications

  • Clinical Decision Support: Automated extraction of diseases, medications, and procedures from Bangla electronic health records (EHRs) can feed downstream triage or alert systems.
  • Health‑Chatbots & Virtual Assistants: Accurate entity recognition enables Bangla‑speaking chatbots to understand patient queries, retrieve relevant medical knowledge, and suggest next steps.
  • Pharmacovigilance & Public Health Surveillance: Mining Bangla social media or news for drug‑related mentions becomes feasible, supporting early detection of adverse events.
  • Cross‑Lingual Transfer: The ensemble framework can be adapted to other low‑resource medical languages by swapping in language‑specific pretrained transformers.
  • Open‑source Toolkit: Developers can plug the released models into popular NLP libraries (Hugging Face Transformers) with minimal code changes, accelerating prototype development.

Limitations & Future Work

  • Dataset Size & Domain Coverage: Although high‑quality, the corpus is still modest (~5 k sentences) and focuses mainly on general medicine; specialty domains (e.g., oncology) remain under‑represented.
  • Annotation Consistency: Inter‑annotator agreement, while acceptable, indicates room for refining the entity schema and handling ambiguous cases.
  • Real‑World Deployment: The models were evaluated on clean, pre‑processed text; noisy inputs (typos, mixed scripts, code‑switching) typical of user‑generated content may degrade performance.
  • Future Directions:
    • Expand the dataset with crowd‑sourced annotations and domain‑specific sub‑corpora.
    • Incorporate character‑level or subword adapters to better handle orthographic variations.
    • Explore multilingual ensemble strategies that blend Bangla models with high‑resource English medical NER systems for zero‑shot transfer.

Bangla MedER demonstrates that a thoughtfully engineered ensemble of transformer models can dramatically improve medical entity extraction in a low‑resource language, offering a practical foundation for Bangla‑centric health‑tech applications.

Authors

  • Tanjim Taharat Aurpa
  • Farzana Akter
  • Md. Mehedi Hasan
  • Shakil Ahmed
  • Shifat Ara Rafiq
  • Fatema Khan

Paper Information

  • arXiv ID: 2512.17769v1
  • Categories: cs.CL, cs.AI
  • Published: December 19, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...