[Paper] CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement

Published: (December 31, 2025 at 11:21 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.24947v1

Overview

The paper introduces CPJ (Caption‑Prompt‑Judge), a training‑free, few‑shot framework that turns large vision‑language models (VLMs) into explainable agricultural pest and disease diagnosticians. By generating structured image captions, refining them with a language‑model “judge,” and feeding the polished captions into a dual‑answer VQA pipeline, CPJ delivers both accurate pest identification and actionable management advice—without any costly supervised fine‑tuning.

Key Contributions

  • Training‑free few‑shot pipeline – eliminates the need for large labeled datasets or expensive fine‑tuning of VLMs for agricultural diagnosis.
  • Caption‑Prompt‑Judge loop – uses a VLM to produce multi‑angle captions, then an LLM (acting as a judge) iteratively refines those captions for factual consistency and completeness.
  • Dual‑answer VQA design – generates two complementary answers:
    1. disease/pest classification
    2. recommended mitigation steps
      based on the refined captions.
  • Significant performance boost – on the CDDMBench benchmark, CPJ lifts disease classification accuracy by +22.7 pp and overall QA score by +19.5 pp compared with a baseline that skips captions.
  • Open‑source release – code, data, and prompts are publicly available, encouraging reproducibility and community extensions.

Methodology

  1. Image → Raw Captions

    • A large vision‑language model (e.g., GPT‑5‑Mini) receives the crop image and a set of prompt templates (e.g., “Describe the visible symptoms”, “Identify the affected plant part”).
    • It outputs several short captions covering different diagnostic angles (symptom description, context, severity).
  2. LLM‑as‑Judge Refinement

    • An LLM (e.g., GPT‑5‑Nano) is tasked with judging each caption: checking factual consistency, completeness, and relevance to pest diagnosis.
    • The judge returns a revised caption and a confidence score. This loop runs a few times (typically 2–3 iterations) until the captions converge.
  3. Dual‑Answer VQA

    • The refined captions are fed into a VQA model that is prompted to answer two questions:
      • Recognition – “What disease or pest is present?”
      • Management – “What immediate action should a farmer take?”
    • Because the VQA model now has a concise, expert‑style textual context, it can produce more accurate and explainable answers.
  4. Few‑Shot Prompting

    • Only a handful of exemplar Q&A pairs are supplied to the VQA model, keeping the approach lightweight and adaptable to new crops or regions.

Results & Findings

MetricNo‑Caption BaselineCPJ (GPT‑5‑Mini captions → GPT‑5‑Nano VQA)
Disease classification accuracy58.3 %81.0 % (+22.7 pp)
Overall VQA score (classification + management)62.1 %81.6 % (+19.5 pp)
  • Robustness to domain shift – when tested on images from unseen farms or different lighting conditions, CPJ’s caption‑driven reasoning degraded far less than the baseline.
  • Explainability – the refined captions serve as human‑readable evidence, allowing agronomists to verify the model’s reasoning step‑by‑step.
  • Efficiency – the entire pipeline runs inference‑only; on a single RTX 4090, processing a batch of 32 images takes ~0.8 seconds per image.

Practical Implications

  • Field‑ready diagnostic apps – developers can embed CPJ into mobile or edge devices, offering farmers instant, explainable disease alerts without needing to ship large labeled datasets for each new crop.
  • Decision‑support dashboards – the caption + answer pair can be displayed side‑by‑side, giving extension officers transparent reasoning to back up recommendations.
  • Rapid adaptation – because CPJ relies on prompts rather than fine‑tuned weights, adding a new pest or a new region is as simple as updating the prompt templates or providing a few new few‑shot examples.
  • Cost savings – eliminates the expensive data‑collection and annotation pipelines traditionally required for high‑accuracy agricultural AI.
  • Regulatory compliance – explainable outputs help meet emerging AI transparency guidelines in agriculture and food safety.

Limitations & Future Work

  • Caption quality ceiling – the approach inherits the strengths and blind spots of the underlying VLM; rare or visually subtle diseases may still be mis‑described.
  • LLM resource demand – while training‑free, the iterative judge step adds latency and requires access to powerful LLM APIs, which may be cost‑prohibitive at massive scale.
  • Benchmark scope – experiments are limited to the CDDMBench dataset; broader field trials across diverse climates and crop varieties are needed.
  • Future directions – the authors suggest exploring lightweight, on‑device LLM judges, integrating multimodal sensor data (e.g., temperature, humidity), and extending the framework to pest‑forecasting (temporal predictions) rather than single‑image diagnosis.

Authors

  • Wentao Zhang
  • Tao Fang
  • Lina Lu
  • Lifei Wang
  • Weihe Zhong

Paper Information

  • arXiv ID: 2512.24947v1
  • Categories: cs.CV, cs.CL
  • Published: December 31, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Web World Models

Language agents increasingly require persistent worlds in which they can act, remember, and learn. Existing approaches sit at two extremes: conventional web fra...