[Paper] CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement
Source: arXiv - 2512.24947v1
Overview
The paper introduces CPJ (Caption‑Prompt‑Judge), a training‑free, few‑shot framework that turns large vision‑language models (VLMs) into explainable agricultural pest and disease diagnosticians. By generating structured image captions, refining them with a language‑model “judge,” and feeding the polished captions into a dual‑answer VQA pipeline, CPJ delivers both accurate pest identification and actionable management advice—without any costly supervised fine‑tuning.
Key Contributions
- Training‑free few‑shot pipeline – eliminates the need for large labeled datasets or expensive fine‑tuning of VLMs for agricultural diagnosis.
- Caption‑Prompt‑Judge loop – uses a VLM to produce multi‑angle captions, then an LLM (acting as a judge) iteratively refines those captions for factual consistency and completeness.
- Dual‑answer VQA design – generates two complementary answers:
- disease/pest classification
- recommended mitigation steps
based on the refined captions.
- Significant performance boost – on the CDDMBench benchmark, CPJ lifts disease classification accuracy by +22.7 pp and overall QA score by +19.5 pp compared with a baseline that skips captions.
- Open‑source release – code, data, and prompts are publicly available, encouraging reproducibility and community extensions.
Methodology
-
Image → Raw Captions
- A large vision‑language model (e.g., GPT‑5‑Mini) receives the crop image and a set of prompt templates (e.g., “Describe the visible symptoms”, “Identify the affected plant part”).
- It outputs several short captions covering different diagnostic angles (symptom description, context, severity).
-
LLM‑as‑Judge Refinement
- An LLM (e.g., GPT‑5‑Nano) is tasked with judging each caption: checking factual consistency, completeness, and relevance to pest diagnosis.
- The judge returns a revised caption and a confidence score. This loop runs a few times (typically 2–3 iterations) until the captions converge.
-
Dual‑Answer VQA
- The refined captions are fed into a VQA model that is prompted to answer two questions:
- Recognition – “What disease or pest is present?”
- Management – “What immediate action should a farmer take?”
- Because the VQA model now has a concise, expert‑style textual context, it can produce more accurate and explainable answers.
- The refined captions are fed into a VQA model that is prompted to answer two questions:
-
Few‑Shot Prompting
- Only a handful of exemplar Q&A pairs are supplied to the VQA model, keeping the approach lightweight and adaptable to new crops or regions.
Results & Findings
| Metric | No‑Caption Baseline | CPJ (GPT‑5‑Mini captions → GPT‑5‑Nano VQA) |
|---|---|---|
| Disease classification accuracy | 58.3 % | 81.0 % (+22.7 pp) |
| Overall VQA score (classification + management) | 62.1 % | 81.6 % (+19.5 pp) |
- Robustness to domain shift – when tested on images from unseen farms or different lighting conditions, CPJ’s caption‑driven reasoning degraded far less than the baseline.
- Explainability – the refined captions serve as human‑readable evidence, allowing agronomists to verify the model’s reasoning step‑by‑step.
- Efficiency – the entire pipeline runs inference‑only; on a single RTX 4090, processing a batch of 32 images takes ~0.8 seconds per image.
Practical Implications
- Field‑ready diagnostic apps – developers can embed CPJ into mobile or edge devices, offering farmers instant, explainable disease alerts without needing to ship large labeled datasets for each new crop.
- Decision‑support dashboards – the caption + answer pair can be displayed side‑by‑side, giving extension officers transparent reasoning to back up recommendations.
- Rapid adaptation – because CPJ relies on prompts rather than fine‑tuned weights, adding a new pest or a new region is as simple as updating the prompt templates or providing a few new few‑shot examples.
- Cost savings – eliminates the expensive data‑collection and annotation pipelines traditionally required for high‑accuracy agricultural AI.
- Regulatory compliance – explainable outputs help meet emerging AI transparency guidelines in agriculture and food safety.
Limitations & Future Work
- Caption quality ceiling – the approach inherits the strengths and blind spots of the underlying VLM; rare or visually subtle diseases may still be mis‑described.
- LLM resource demand – while training‑free, the iterative judge step adds latency and requires access to powerful LLM APIs, which may be cost‑prohibitive at massive scale.
- Benchmark scope – experiments are limited to the CDDMBench dataset; broader field trials across diverse climates and crop varieties are needed.
- Future directions – the authors suggest exploring lightweight, on‑device LLM judges, integrating multimodal sensor data (e.g., temperature, humidity), and extending the framework to pest‑forecasting (temporal predictions) rather than single‑image diagnosis.
Authors
- Wentao Zhang
- Tao Fang
- Lina Lu
- Lifei Wang
- Weihe Zhong
Paper Information
- arXiv ID: 2512.24947v1
- Categories: cs.CV, cs.CL
- Published: December 31, 2025
- PDF: Download PDF