[Paper] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering

Published: (January 8, 2026 at 12:31 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.05143v1

Overview

A new paper introduces a lightweight, explainable vision‑language model that can answer natural‑language questions about crop diseases directly from leaf images. By marrying a Swin‑Transformer visual encoder with a compact sequence‑to‑sequence language decoder, the authors achieve high accuracy while keeping the model size small enough for real‑world deployment on edge devices.

Key Contributions

  • Compact architecture: Uses a Swin Transformer backbone and a modest seq2seq decoder, delivering comparable or better performance than heavyweight V‑L baselines with ~10× fewer parameters.
  • Two‑stage training pipeline: First pre‑trains the visual encoder on a large leaf‑image corpus, then fine‑tunes the full vision‑language system for cross‑modal alignment, improving both classification and language generation.
  • Explainability toolkit: Integrates Grad‑CAM visualizations and token‑level attribution to surface why the model predicts a certain crop or disease and how it forms its answer.
  • Comprehensive evaluation: Reports both classification metrics (accuracy, F1) and NLG metrics (BLEU, ROUGE, BERTScore) on a large, publicly‑available crop‑disease dataset.
  • Robustness to diverse queries: Demonstrates stable performance across a variety of user‑driven question styles (e.g., “What disease is this leaf suffering from?” vs. “Is this plant healthy?”).

Methodology

  1. Vision Encoder – Swin Transformer

    • Processes high‑resolution leaf images using a hierarchical, shifted‑window attention mechanism.
    • Pre‑trained on a domain‑specific leaf image collection to capture fine‑grained disease patterns (spots, discoloration, texture).
  2. Language Decoder – Seq2Seq Transformer

    • Takes the visual token embeddings from the encoder and generates natural‑language answers token‑by‑token.
    • Uses a modest number of layers (typically 4–6) to keep inference latency low.
  3. Two‑Stage Training

    • Stage 1 – Visual Pretraining: Freeze the language head, train the Swin encoder on a leaf‑image classification task (crop + disease labels).
    • Stage 2 – Cross‑Modal Fine‑Tuning: Unfreeze the whole network and train on paired (image, question, answer) triples, optimizing a combined loss: classification cross‑entropy + language generation cross‑entropy.
  4. Explainability

    • Grad‑CAM highlights image regions that most influence the encoder’s output.
    • Token‑level attribution (via integrated gradients) shows which visual tokens contributed to each generated word, helping users trust the answer.

Results & Findings

MetricVision‑Language BaselineProposed Model
Crop classification accuracy92.1 %94.8 %
Disease classification accuracy88.3 %91.5 %
BLEU‑4 (answer generation)0.620.71
ROUGE‑L0.680.75
BERTScore0.840.89
Parameters (M)250≈25
Inference time on CPU (ms)210≈38
  • The model outperforms large‑scale V‑L baselines (e.g., ViLT, LXMERT) on both visual and language metrics while using an order of magnitude fewer parameters.
  • Explainability visualizations consistently focus on disease‑specific lesions (e.g., rust pustules, blight spots), confirming that the encoder learns semantically meaningful features.
  • Qualitative tests show the system handling varied phrasings, multi‑step queries (“Is this leaf infected? If yes, what disease?”), and even ambiguous questions with graceful “I’m not sure” responses.

Practical Implications

  • Edge deployment: The small footprint enables integration into smartphones, low‑cost drones, or IoT sensors used by farmers, providing instant disease diagnostics without cloud connectivity.
  • Decision support: By returning natural‑language explanations (“The leaf shows circular brown spots typical of Septoria disease”), the system can be embedded in farm management software, reducing the need for specialist agronomists on site.
  • Scalable data collection: The two‑stage training recipe can be adapted to new crops or emerging pathogens by simply adding a modest amount of labeled leaf images, making the pipeline future‑proof.
  • Educational tools: Explainable V‑L outputs can serve as interactive teaching aids for agronomy students, illustrating visual cues linked to disease terminology.

Limitations & Future Work

  • Dataset bias: The training set, while large, is sourced mainly from controlled environments; performance may degrade on images with extreme lighting or occlusions typical of field conditions.
  • Question diversity: Current experiments focus on a limited set of templated questions; expanding to open‑ended or multi‑turn dialogues remains an open challenge.
  • Cross‑crop generalization: The model is tuned per‑crop; a universal model that can handle any plant without retraining would further simplify deployment.
  • Explainability depth: Grad‑CAM provides coarse heatmaps; future work could explore more granular attribution methods (e.g., attention roll‑out) to better align visual cues with specific disease terminology.

Bottom line: This lightweight, explainable V‑L framework demonstrates that high‑quality crop disease Q&A is achievable without massive models, opening the door for practical AI‑assisted agriculture on the ground.

Authors

  • Md. Zahid Hossain
  • Most. Sharmin Sultana Samu
  • Md. Rakibul Islam
  • Md. Siam Ansary

Paper Information

  • arXiv ID: 2601.05143v1
  • Categories: cs.CV, cs.CL
  • Published: January 8, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »