[Paper] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering

Published: 1 month ago (January 8, 2026 at 12:31 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.05143v1

Overview

A new paper introduces a lightweight, explainable vision‑language model that can answer natural‑language questions about crop diseases directly from leaf images. By marrying a Swin‑Transformer visual encoder with a compact sequence‑to‑sequence language decoder, the authors achieve high accuracy while keeping the model size small enough for real‑world deployment on edge devices.

Key Contributions

Compact architecture: Uses a Swin Transformer backbone and a modest seq2seq decoder, delivering comparable or better performance than heavyweight V‑L baselines with ~10× fewer parameters.
Two‑stage training pipeline: First pre‑trains the visual encoder on a large leaf‑image corpus, then fine‑tunes the full vision‑language system for cross‑modal alignment, improving both classification and language generation.
Explainability toolkit: Integrates Grad‑CAM visualizations and token‑level attribution to surface why the model predicts a certain crop or disease and how it forms its answer.
Comprehensive evaluation: Reports both classification metrics (accuracy, F1) and NLG metrics (BLEU, ROUGE, BERTScore) on a large, publicly‑available crop‑disease dataset.
Robustness to diverse queries: Demonstrates stable performance across a variety of user‑driven question styles (e.g., “What disease is this leaf suffering from?” vs. “Is this plant healthy?”).

Methodology

Vision Encoder – Swin Transformer
- Processes high‑resolution leaf images using a hierarchical, shifted‑window attention mechanism.
- Pre‑trained on a domain‑specific leaf image collection to capture fine‑grained disease patterns (spots, discoloration, texture).
Language Decoder – Seq2Seq Transformer
- Takes the visual token embeddings from the encoder and generates natural‑language answers token‑by‑token.
- Uses a modest number of layers (typically 4–6) to keep inference latency low.
Two‑Stage Training
- Stage 1 – Visual Pretraining: Freeze the language head, train the Swin encoder on a leaf‑image classification task (crop + disease labels).
- Stage 2 – Cross‑Modal Fine‑Tuning: Unfreeze the whole network and train on paired (image, question, answer) triples, optimizing a combined loss: classification cross‑entropy + language generation cross‑entropy.
Explainability
- Grad‑CAM highlights image regions that most influence the encoder’s output.
- Token‑level attribution (via integrated gradients) shows which visual tokens contributed to each generated word, helping users trust the answer.

Results & Findings

Metric	Vision‑Language Baseline	Proposed Model
Crop classification accuracy	92.1 %	94.8 %
Disease classification accuracy	88.3 %	91.5 %
BLEU‑4 (answer generation)	0.62	0.71
ROUGE‑L	0.68	0.75
BERTScore	0.84	0.89
Parameters (M)	250	≈25
Inference time on CPU (ms)	210	≈38

The model outperforms large‑scale V‑L baselines (e.g., ViLT, LXMERT) on both visual and language metrics while using an order of magnitude fewer parameters.
Explainability visualizations consistently focus on disease‑specific lesions (e.g., rust pustules, blight spots), confirming that the encoder learns semantically meaningful features.
Qualitative tests show the system handling varied phrasings, multi‑step queries (“Is this leaf infected? If yes, what disease?”), and even ambiguous questions with graceful “I’m not sure” responses.

Practical Implications

Edge deployment: The small footprint enables integration into smartphones, low‑cost drones, or IoT sensors used by farmers, providing instant disease diagnostics without cloud connectivity.
Decision support: By returning natural‑language explanations (“The leaf shows circular brown spots typical of Septoria disease”), the system can be embedded in farm management software, reducing the need for specialist agronomists on site.
Scalable data collection: The two‑stage training recipe can be adapted to new crops or emerging pathogens by simply adding a modest amount of labeled leaf images, making the pipeline future‑proof.
Educational tools: Explainable V‑L outputs can serve as interactive teaching aids for agronomy students, illustrating visual cues linked to disease terminology.

Limitations & Future Work

Dataset bias: The training set, while large, is sourced mainly from controlled environments; performance may degrade on images with extreme lighting or occlusions typical of field conditions.
Question diversity: Current experiments focus on a limited set of templated questions; expanding to open‑ended or multi‑turn dialogues remains an open challenge.
Cross‑crop generalization: The model is tuned per‑crop; a universal model that can handle any plant without retraining would further simplify deployment.
Explainability depth: Grad‑CAM provides coarse heatmaps; future work could explore more granular attribution methods (e.g., attention roll‑out) to better align visual cues with specific disease terminology.

Bottom line: This lightweight, explainable V‑L framework demonstrates that high‑quality crop disease Q&A is achievable without massive models, opening the door for practical AI‑assisted agriculture on the ground.

Authors

Md. Zahid Hossain
Most. Sharmin Sultana Samu
Md. Rakibul Islam
Md. Siam Ansary

Paper Information

arXiv ID: 2601.05143v1
Categories: cs.CV, cs.CL
Published: January 8, 2026
PDF: Download PDF

[Paper] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

[Paper] InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

[Paper] Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts

[Paper] Multi-Modal Data-Enhanced Foundation Models for Prediction and Control in Wireless Networks: A Survey