[Paper] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering
Source: arXiv - 2601.05143v1
Overview
A new paper introduces a lightweight, explainable vision‑language model that can answer natural‑language questions about crop diseases directly from leaf images. By marrying a Swin‑Transformer visual encoder with a compact sequence‑to‑sequence language decoder, the authors achieve high accuracy while keeping the model size small enough for real‑world deployment on edge devices.
Key Contributions
- Compact architecture: Uses a Swin Transformer backbone and a modest seq2seq decoder, delivering comparable or better performance than heavyweight V‑L baselines with ~10× fewer parameters.
- Two‑stage training pipeline: First pre‑trains the visual encoder on a large leaf‑image corpus, then fine‑tunes the full vision‑language system for cross‑modal alignment, improving both classification and language generation.
- Explainability toolkit: Integrates Grad‑CAM visualizations and token‑level attribution to surface why the model predicts a certain crop or disease and how it forms its answer.
- Comprehensive evaluation: Reports both classification metrics (accuracy, F1) and NLG metrics (BLEU, ROUGE, BERTScore) on a large, publicly‑available crop‑disease dataset.
- Robustness to diverse queries: Demonstrates stable performance across a variety of user‑driven question styles (e.g., “What disease is this leaf suffering from?” vs. “Is this plant healthy?”).
Methodology
-
Vision Encoder – Swin Transformer
- Processes high‑resolution leaf images using a hierarchical, shifted‑window attention mechanism.
- Pre‑trained on a domain‑specific leaf image collection to capture fine‑grained disease patterns (spots, discoloration, texture).
-
Language Decoder – Seq2Seq Transformer
- Takes the visual token embeddings from the encoder and generates natural‑language answers token‑by‑token.
- Uses a modest number of layers (typically 4–6) to keep inference latency low.
-
Two‑Stage Training
- Stage 1 – Visual Pretraining: Freeze the language head, train the Swin encoder on a leaf‑image classification task (crop + disease labels).
- Stage 2 – Cross‑Modal Fine‑Tuning: Unfreeze the whole network and train on paired (image, question, answer) triples, optimizing a combined loss: classification cross‑entropy + language generation cross‑entropy.
-
Explainability
- Grad‑CAM highlights image regions that most influence the encoder’s output.
- Token‑level attribution (via integrated gradients) shows which visual tokens contributed to each generated word, helping users trust the answer.
Results & Findings
| Metric | Vision‑Language Baseline | Proposed Model |
|---|---|---|
| Crop classification accuracy | 92.1 % | 94.8 % |
| Disease classification accuracy | 88.3 % | 91.5 % |
| BLEU‑4 (answer generation) | 0.62 | 0.71 |
| ROUGE‑L | 0.68 | 0.75 |
| BERTScore | 0.84 | 0.89 |
| Parameters (M) | 250 | ≈25 |
| Inference time on CPU (ms) | 210 | ≈38 |
- The model outperforms large‑scale V‑L baselines (e.g., ViLT, LXMERT) on both visual and language metrics while using an order of magnitude fewer parameters.
- Explainability visualizations consistently focus on disease‑specific lesions (e.g., rust pustules, blight spots), confirming that the encoder learns semantically meaningful features.
- Qualitative tests show the system handling varied phrasings, multi‑step queries (“Is this leaf infected? If yes, what disease?”), and even ambiguous questions with graceful “I’m not sure” responses.
Practical Implications
- Edge deployment: The small footprint enables integration into smartphones, low‑cost drones, or IoT sensors used by farmers, providing instant disease diagnostics without cloud connectivity.
- Decision support: By returning natural‑language explanations (“The leaf shows circular brown spots typical of Septoria disease”), the system can be embedded in farm management software, reducing the need for specialist agronomists on site.
- Scalable data collection: The two‑stage training recipe can be adapted to new crops or emerging pathogens by simply adding a modest amount of labeled leaf images, making the pipeline future‑proof.
- Educational tools: Explainable V‑L outputs can serve as interactive teaching aids for agronomy students, illustrating visual cues linked to disease terminology.
Limitations & Future Work
- Dataset bias: The training set, while large, is sourced mainly from controlled environments; performance may degrade on images with extreme lighting or occlusions typical of field conditions.
- Question diversity: Current experiments focus on a limited set of templated questions; expanding to open‑ended or multi‑turn dialogues remains an open challenge.
- Cross‑crop generalization: The model is tuned per‑crop; a universal model that can handle any plant without retraining would further simplify deployment.
- Explainability depth: Grad‑CAM provides coarse heatmaps; future work could explore more granular attribution methods (e.g., attention roll‑out) to better align visual cues with specific disease terminology.
Bottom line: This lightweight, explainable V‑L framework demonstrates that high‑quality crop disease Q&A is achievable without massive models, opening the door for practical AI‑assisted agriculture on the ground.
Authors
- Md. Zahid Hossain
- Most. Sharmin Sultana Samu
- Md. Rakibul Islam
- Md. Siam Ansary
Paper Information
- arXiv ID: 2601.05143v1
- Categories: cs.CV, cs.CL
- Published: January 8, 2026
- PDF: Download PDF