[Paper] BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla

Published: (November 26, 2025 at 08:11 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21364v1

Overview

A new study introduces BanglaMM-Disaster, a multimodal deep‑learning framework that simultaneously reads Bangla text and analyzes accompanying images to classify disaster‑related social‑media posts into nine categories. By fusing language and vision models, the authors achieve a noticeable boost in accuracy over single‑modality baselines, opening the door to faster, more reliable disaster monitoring in Bangladesh and other low‑resource language settings.

Key Contributions

  • Bangla‑specific multimodal dataset – 5,037 social‑media posts, each with a Bangla caption and an image, manually labeled into nine disaster classes.
  • End‑to‑end transformer‑CNN architecture – combines Bangla‑focused text encoders (BanglaBERT, mBERT, XLM‑R) with visual backbones (ResNet‑50, DenseNet‑169, MobileNet‑V2) via early fusion.
  • State‑of‑the‑art performance – best configuration reaches 83.76 % accuracy, beating the strongest text‑only model by 3.84 % and the image‑only model by 16.91 %.
  • Comprehensive error analysis – demonstrates reduced misclassifications, especially for ambiguous posts where text or image alone is insufficient.
  • Open‑source potential – the authors release the dataset and code, providing a baseline for future Bangla multimodal research.

Methodology

  1. Data collection & annotation – Posts were scraped from public Bangla social‑media channels, filtered for disaster relevance, and labeled by domain experts into categories such as Flood, Cyclone, Fire, etc.
  2. Text processing – Captions are tokenized and fed into pre‑trained transformer models (BanglaBERT, multilingual BERT, XLM‑R). The final hidden state (CLS token) serves as the textual embedding.
  3. Image processing – Images pass through a convolutional neural network (ResNet‑50, DenseNet‑169, or MobileNet‑V2) pretrained on ImageNet; the penultimate layer’s feature map is extracted as the visual embedding.
  4. Early fusion – Text and visual embeddings are concatenated, then passed through a small fully‑connected classifier (two dense layers + softmax). The whole pipeline is trained end‑to‑end with cross‑entropy loss.
  5. Training details – Standard data augmentations for images, AdamW optimizer, learning‑rate warm‑up, and 5‑fold cross‑validation to ensure robust estimates.

Results & Findings

Model (Text + Image)AccuracyGain vs. Text‑onlyGain vs. Image‑only
BanglaBERT + ResNet‑50 (early fusion)83.76 %+3.84 %+16.91 %
mBERT + DenseNet‑16982.9 %+2.9 %+15.6 %
XLM‑R + MobileNet‑V281.7 %+1.8 %+14.3 %
  • Error reduction: Across all nine classes, the multimodal system cuts the top‑1 error rate by an average of 12 %, with the biggest improvements for Landslide and Storm Surge where visual cues are decisive.
  • Ablation study: Removing the early‑fusion step (late fusion) drops accuracy by ~2 %, confirming that joint representation learning is beneficial.
  • Resource efficiency: MobileNet‑V2‑based variants achieve comparable performance (>80 % accuracy) with ~30 % fewer FLOPs, making them viable for edge deployment.

Practical Implications

  • Real‑time disaster dashboards: Emergency agencies can ingest live Bangla tweets or Facebook posts, automatically flag high‑risk content, and prioritize response crews.
  • Low‑resource language support: The framework demonstrates that existing multilingual transformers (mBERT, XLM‑R) can be effectively combined with vision models without massive Bangla‑specific pre‑training, lowering the barrier for other under‑represented languages.
  • Edge‑ready monitoring tools: The MobileNet‑V2 variant can run on smartphones or Raspberry‑Pi‑class devices, enabling community volunteers to run local classifiers offline when connectivity is disrupted.
  • Cross‑modal data enrichment: Developers building chatbots, crisis‑mapping platforms, or news‑aggregation services can plug the model in as a “disaster‑confidence” scorer, improving content moderation and alerting pipelines.

Limitations & Future Work

  • Dataset size & diversity: 5 k posts is modest; expanding to millions of multilingual posts would test scalability and robustness.
  • Class imbalance: Some disaster categories (e.g., Earthquake) have far fewer examples, which may still bias predictions. Techniques like focal loss or synthetic oversampling could help.
  • Temporal dynamics: The current model processes each post independently; incorporating time‑series or geospatial context could improve early detection of evolving events.
  • Explainability: While early fusion yields better accuracy, the system offers limited insight into whether the text or image drove a particular decision—future work could integrate attention visualizations or multimodal saliency maps.

BanglaMM-Disaster showcases how a relatively simple early‑fusion of state‑of‑the‑art language and vision models can dramatically improve disaster classification in a low‑resource language. For developers building next‑generation crisis‑response tools, the paper provides both a ready‑to‑use dataset and a clear architectural blueprint that can be adapted to other languages and domains.

Authors

  • Ariful Islam
  • Md Rifat Hossen
  • Md. Mahmudul Arif
  • Abdullah Al Noman
  • Md Arifur Rahman

Paper Information

  • arXiv ID: 2511.21364v1
  • Categories: cs.LG, cs.CV
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

It’s code red for ChatGPT

A smidge over three years ago, OpenAI threw the rest of the tech industry into chaos. When ChatGPT launched, even billed as a 'low-key research preview,' it bec...