[Paper] BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla

Published: 1 month ago (November 26, 2025 at 08:11 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21364v1

Overview

A new study introduces BanglaMM-Disaster, a multimodal deep‑learning framework that simultaneously reads Bangla text and analyzes accompanying images to classify disaster‑related social‑media posts into nine categories. By fusing language and vision models, the authors achieve a noticeable boost in accuracy over single‑modality baselines, opening the door to faster, more reliable disaster monitoring in Bangladesh and other low‑resource language settings.

Key Contributions

Bangla‑specific multimodal dataset – 5,037 social‑media posts, each with a Bangla caption and an image, manually labeled into nine disaster classes.
End‑to‑end transformer‑CNN architecture – combines Bangla‑focused text encoders (BanglaBERT, mBERT, XLM‑R) with visual backbones (ResNet‑50, DenseNet‑169, MobileNet‑V2) via early fusion.
State‑of‑the‑art performance – best configuration reaches 83.76 % accuracy, beating the strongest text‑only model by 3.84 % and the image‑only model by 16.91 %.
Comprehensive error analysis – demonstrates reduced misclassifications, especially for ambiguous posts where text or image alone is insufficient.
Open‑source potential – the authors release the dataset and code, providing a baseline for future Bangla multimodal research.

Methodology

Data collection & annotation – Posts were scraped from public Bangla social‑media channels, filtered for disaster relevance, and labeled by domain experts into categories such as Flood, Cyclone, Fire, etc.
Text processing – Captions are tokenized and fed into pre‑trained transformer models (BanglaBERT, multilingual BERT, XLM‑R). The final hidden state (CLS token) serves as the textual embedding.
Image processing – Images pass through a convolutional neural network (ResNet‑50, DenseNet‑169, or MobileNet‑V2) pretrained on ImageNet; the penultimate layer’s feature map is extracted as the visual embedding.
Early fusion – Text and visual embeddings are concatenated, then passed through a small fully‑connected classifier (two dense layers + softmax). The whole pipeline is trained end‑to‑end with cross‑entropy loss.
Training details – Standard data augmentations for images, AdamW optimizer, learning‑rate warm‑up, and 5‑fold cross‑validation to ensure robust estimates.

Results & Findings

Model (Text + Image)	Accuracy	Gain vs. Text‑only	Gain vs. Image‑only
BanglaBERT + ResNet‑50 (early fusion)	83.76 %	+3.84 %	+16.91 %
mBERT + DenseNet‑169	82.9 %	+2.9 %	+15.6 %
XLM‑R + MobileNet‑V2	81.7 %	+1.8 %	+14.3 %

Error reduction: Across all nine classes, the multimodal system cuts the top‑1 error rate by an average of 12 %, with the biggest improvements for Landslide and Storm Surge where visual cues are decisive.
Ablation study: Removing the early‑fusion step (late fusion) drops accuracy by ~2 %, confirming that joint representation learning is beneficial.
Resource efficiency: MobileNet‑V2‑based variants achieve comparable performance (>80 % accuracy) with ~30 % fewer FLOPs, making them viable for edge deployment.

Practical Implications

Real‑time disaster dashboards: Emergency agencies can ingest live Bangla tweets or Facebook posts, automatically flag high‑risk content, and prioritize response crews.
Low‑resource language support: The framework demonstrates that existing multilingual transformers (mBERT, XLM‑R) can be effectively combined with vision models without massive Bangla‑specific pre‑training, lowering the barrier for other under‑represented languages.
Edge‑ready monitoring tools: The MobileNet‑V2 variant can run on smartphones or Raspberry‑Pi‑class devices, enabling community volunteers to run local classifiers offline when connectivity is disrupted.
Cross‑modal data enrichment: Developers building chatbots, crisis‑mapping platforms, or news‑aggregation services can plug the model in as a “disaster‑confidence” scorer, improving content moderation and alerting pipelines.

Limitations & Future Work

Dataset size & diversity: 5 k posts is modest; expanding to millions of multilingual posts would test scalability and robustness.
Class imbalance: Some disaster categories (e.g., Earthquake) have far fewer examples, which may still bias predictions. Techniques like focal loss or synthetic oversampling could help.
Temporal dynamics: The current model processes each post independently; incorporating time‑series or geospatial context could improve early detection of evolving events.
Explainability: While early fusion yields better accuracy, the system offers limited insight into whether the text or image drove a particular decision—future work could integrate attention visualizations or multimodal saliency maps.

BanglaMM-Disaster showcases how a relatively simple early‑fusion of state‑of‑the‑art language and vision models can dramatically improve disaster classification in a low‑resource language. For developers building next‑generation crisis‑response tools, the paper provides both a ready‑to‑use dataset and a clear architectural blueprint that can be adapted to other languages and domains.

Authors

Ariful Islam
Md Rifat Hossen
Md. Mahmudul Arif
Abdullah Al Noman
Md Arifur Rahman

Paper Information

arXiv ID: 2511.21364v1
Categories: cs.LG, cs.CV
Published: November 26, 2025
PDF: Download PDF

[Paper] BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] BanglaASTE: A Novel Framework for Aspect-Sentiment-Opinion Extraction in Bangla E-commerce Reviews Using Ensemble Deep Learning

AI agents find $4.6M in blockchain smart contract exploits

Apple AI chief steps down following Siri setbacks

Apple AI Chief Retiring After Siri Failure