[Paper] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity
Source: arXiv - 2512.17769v1
Overview
The paper introduces Bangla MedER, a new benchmark for recognizing medical entities in Bangla text, and proposes a Multi‑BERT Ensemble model that pushes accuracy close to 90 %. By tackling the scarcity of annotated Bangla medical data, the work opens the door for NLP‑driven healthcare tools in a language that has been largely ignored by the research community.
Key Contributions
- Bangla MedER dataset: a manually curated, high‑quality corpus of Bangla medical sentences with entity annotations (e.g., diseases, drugs, procedures).
- Comprehensive baseline study: evaluation of several transformer families (BERT, DistilBERT, ELECTRA, RoBERTa) on the new dataset.
- Multi‑BERT Ensemble architecture: combines predictions from multiple fine‑tuned BERT models using a voting/stacking scheme, achieving 89.58 % accuracy, an 11.80 % gain over a single‑layer BERT.
- Extensive evaluation: reports precision, recall, F1‑score, and confusion matrices across entity types, demonstrating robustness.
- Open‑source release: code, trained models, and the dataset are made publicly available to encourage reproducibility and further research.
Methodology
-
Data Collection & Annotation
- Gathered Bangla medical texts from public health portals, research articles, and clinical notes.
- Professional annotators labeled entities such as Disease, Medication, Symptom, and Procedure following a predefined schema.
-
Model Fine‑Tuning
- Each transformer (BERT‑base, DistilBERT, ELECTRA‑small, RoBERTa‑base) was fine‑tuned on the Bangla MedER training split using a token‑level classification head (softmax over entity tags).
-
Ensemble Construction
- After individual fine‑tuning, the models’ logits for each token were aggregated.
- Two strategies were explored:
- Majority voting (hard ensemble) – the most common tag among models wins.
- Stacked meta‑learner (soft ensemble) – a lightweight feed‑forward network learns to weight each model’s confidence scores.
- The stacked approach yielded the best performance and is referred to as the Multi‑BERT Ensemble.
-
Evaluation
- Standard NER metrics (precision, recall, F1) were computed per entity class and overall.
- A held‑out test set ensured that the ensemble’s gains were not due to overfitting.
Results & Findings
| Model | Accuracy | Macro‑F1 |
|---|---|---|
| BERT‑base (single layer) | 77.78 % | 0.73 |
| DistilBERT | 80.12 % | 0.75 |
| ELECTRA‑small | 81.45 % | 0.77 |
| RoBERTa‑base | 82.30 % | 0.78 |
| Multi‑BERT Ensemble | 89.58 % | 0.86 |
- The ensemble outperformed the strongest single model (RoBERTa) by 7.28 % in accuracy and 0.09 in macro‑F1.
- Gains were especially pronounced for low‑frequency entities (e.g., Procedure), where the ensemble mitigated individual model biases.
- Error analysis showed that most remaining mistakes stem from ambiguous phrasing and domain‑specific abbreviations not seen during training.
Practical Implications
- Clinical Decision Support: Automated extraction of diseases, medications, and procedures from Bangla electronic health records (EHRs) can feed downstream triage or alert systems.
- Health‑Chatbots & Virtual Assistants: Accurate entity recognition enables Bangla‑speaking chatbots to understand patient queries, retrieve relevant medical knowledge, and suggest next steps.
- Pharmacovigilance & Public Health Surveillance: Mining Bangla social media or news for drug‑related mentions becomes feasible, supporting early detection of adverse events.
- Cross‑Lingual Transfer: The ensemble framework can be adapted to other low‑resource medical languages by swapping in language‑specific pretrained transformers.
- Open‑source Toolkit: Developers can plug the released models into popular NLP libraries (Hugging Face Transformers) with minimal code changes, accelerating prototype development.
Limitations & Future Work
- Dataset Size & Domain Coverage: Although high‑quality, the corpus is still modest (~5 k sentences) and focuses mainly on general medicine; specialty domains (e.g., oncology) remain under‑represented.
- Annotation Consistency: Inter‑annotator agreement, while acceptable, indicates room for refining the entity schema and handling ambiguous cases.
- Real‑World Deployment: The models were evaluated on clean, pre‑processed text; noisy inputs (typos, mixed scripts, code‑switching) typical of user‑generated content may degrade performance.
- Future Directions:
- Expand the dataset with crowd‑sourced annotations and domain‑specific sub‑corpora.
- Incorporate character‑level or subword adapters to better handle orthographic variations.
- Explore multilingual ensemble strategies that blend Bangla models with high‑resource English medical NER systems for zero‑shot transfer.
Bangla MedER demonstrates that a thoughtfully engineered ensemble of transformer models can dramatically improve medical entity extraction in a low‑resource language, offering a practical foundation for Bangla‑centric health‑tech applications.
Authors
- Tanjim Taharat Aurpa
- Farzana Akter
- Md. Mehedi Hasan
- Shakil Ahmed
- Shifat Ara Rafiq
- Fatema Khan
Paper Information
- arXiv ID: 2512.17769v1
- Categories: cs.CL, cs.AI
- Published: December 19, 2025
- PDF: Download PDF