[Paper] BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla
Source: arXiv - 2511.21364v1
Overview
A new study introduces BanglaMM-Disaster, a multimodal deep‑learning framework that simultaneously reads Bangla text and analyzes accompanying images to classify disaster‑related social‑media posts into nine categories. By fusing language and vision models, the authors achieve a noticeable boost in accuracy over single‑modality baselines, opening the door to faster, more reliable disaster monitoring in Bangladesh and other low‑resource language settings.
Key Contributions
- Bangla‑specific multimodal dataset – 5,037 social‑media posts, each with a Bangla caption and an image, manually labeled into nine disaster classes.
- End‑to‑end transformer‑CNN architecture – combines Bangla‑focused text encoders (BanglaBERT, mBERT, XLM‑R) with visual backbones (ResNet‑50, DenseNet‑169, MobileNet‑V2) via early fusion.
- State‑of‑the‑art performance – best configuration reaches 83.76 % accuracy, beating the strongest text‑only model by 3.84 % and the image‑only model by 16.91 %.
- Comprehensive error analysis – demonstrates reduced misclassifications, especially for ambiguous posts where text or image alone is insufficient.
- Open‑source potential – the authors release the dataset and code, providing a baseline for future Bangla multimodal research.
Methodology
- Data collection & annotation – Posts were scraped from public Bangla social‑media channels, filtered for disaster relevance, and labeled by domain experts into categories such as Flood, Cyclone, Fire, etc.
- Text processing – Captions are tokenized and fed into pre‑trained transformer models (BanglaBERT, multilingual BERT, XLM‑R). The final hidden state (CLS token) serves as the textual embedding.
- Image processing – Images pass through a convolutional neural network (ResNet‑50, DenseNet‑169, or MobileNet‑V2) pretrained on ImageNet; the penultimate layer’s feature map is extracted as the visual embedding.
- Early fusion – Text and visual embeddings are concatenated, then passed through a small fully‑connected classifier (two dense layers + softmax). The whole pipeline is trained end‑to‑end with cross‑entropy loss.
- Training details – Standard data augmentations for images, AdamW optimizer, learning‑rate warm‑up, and 5‑fold cross‑validation to ensure robust estimates.
Results & Findings
| Model (Text + Image) | Accuracy | Gain vs. Text‑only | Gain vs. Image‑only |
|---|---|---|---|
| BanglaBERT + ResNet‑50 (early fusion) | 83.76 % | +3.84 % | +16.91 % |
| mBERT + DenseNet‑169 | 82.9 % | +2.9 % | +15.6 % |
| XLM‑R + MobileNet‑V2 | 81.7 % | +1.8 % | +14.3 % |
- Error reduction: Across all nine classes, the multimodal system cuts the top‑1 error rate by an average of 12 %, with the biggest improvements for Landslide and Storm Surge where visual cues are decisive.
- Ablation study: Removing the early‑fusion step (late fusion) drops accuracy by ~2 %, confirming that joint representation learning is beneficial.
- Resource efficiency: MobileNet‑V2‑based variants achieve comparable performance (>80 % accuracy) with ~30 % fewer FLOPs, making them viable for edge deployment.
Practical Implications
- Real‑time disaster dashboards: Emergency agencies can ingest live Bangla tweets or Facebook posts, automatically flag high‑risk content, and prioritize response crews.
- Low‑resource language support: The framework demonstrates that existing multilingual transformers (mBERT, XLM‑R) can be effectively combined with vision models without massive Bangla‑specific pre‑training, lowering the barrier for other under‑represented languages.
- Edge‑ready monitoring tools: The MobileNet‑V2 variant can run on smartphones or Raspberry‑Pi‑class devices, enabling community volunteers to run local classifiers offline when connectivity is disrupted.
- Cross‑modal data enrichment: Developers building chatbots, crisis‑mapping platforms, or news‑aggregation services can plug the model in as a “disaster‑confidence” scorer, improving content moderation and alerting pipelines.
Limitations & Future Work
- Dataset size & diversity: 5 k posts is modest; expanding to millions of multilingual posts would test scalability and robustness.
- Class imbalance: Some disaster categories (e.g., Earthquake) have far fewer examples, which may still bias predictions. Techniques like focal loss or synthetic oversampling could help.
- Temporal dynamics: The current model processes each post independently; incorporating time‑series or geospatial context could improve early detection of evolving events.
- Explainability: While early fusion yields better accuracy, the system offers limited insight into whether the text or image drove a particular decision—future work could integrate attention visualizations or multimodal saliency maps.
BanglaMM-Disaster showcases how a relatively simple early‑fusion of state‑of‑the‑art language and vision models can dramatically improve disaster classification in a low‑resource language. For developers building next‑generation crisis‑response tools, the paper provides both a ready‑to‑use dataset and a clear architectural blueprint that can be adapted to other languages and domains.
Authors
- Ariful Islam
- Md Rifat Hossen
- Md. Mahmudul Arif
- Abdullah Al Noman
- Md Arifur Rahman
Paper Information
- arXiv ID: 2511.21364v1
- Categories: cs.LG, cs.CV
- Published: November 26, 2025
- PDF: Download PDF