[Paper] Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls

Published: (December 18, 2025 at 02:43 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.16272v1

Overview

Large Language Models are increasingly being used as automated judges (LaaJ) to evaluate code generated by AI systems. This paper investigates a real‑world scenario—modernizing legacy COBOL applications—and shows that even production‑grade LaaJs miss many critical bugs. By coupling the LLM judges with a lightweight static‑analysis “hint” engine, the authors dramatically improve error detection and explanation quality.

Key Contributions

  • Empirical audit of LaaJ on COBOL modernization: Demonstrates that four production‑level LLM judges catch only ~45 % of real defects in generated code.
  • Domain‑specific taxonomy of blind spots: Catalogues >30 recurring COBOL‑related issues that LaaJs routinely overlook (e.g., incorrect data‑type sizing, misplaced PERFORM statements, legacy API misuse).
  • Analytic hint generator: A lightweight static‑analysis tool that flags the taxonomy’s issues and produces concise, machine‑readable hints.
  • Hybrid evaluation pipeline (LaaJ + Hints): Shows that injecting these hints into the LLM’s prompt boosts detection coverage up to 94 % for the best judge, while also yielding richer explanations.
  • Open resources: Releases the annotated dataset, taxonomy, prompts, and the hint‑generation code for reproducibility and community extension.

Methodology

  1. Data collection – The team gathered 100 COBOL programs generated by an internal code‑generation model, each paired with a ground‑truth defect list created by senior COBOL engineers.
  2. Baseline evaluation – Four production LaaJs (GPT‑4, Claude, Llama‑2‑Chat, and a proprietary model) were prompted to assess each program and produce error reports.
  3. Blind‑spot analysis – Researchers compared LaaJ outputs against the expert defect list, extracting recurring missed patterns and grouping them into a taxonomy.
  4. Hint engine development – A rule‑based static analyzer (≈200 lines of Python) scans a COBOL file, matches it against the taxonomy, and emits short “hint” statements (e.g., Check that PIC 9(5) fields are not truncated).
  5. Hybrid prompting – The original LaaJ prompt is augmented with the generated hints, asking the model to “re‑evaluate with these considerations in mind.”
  6. Metrics – Coverage (percentage of true defects detected) and explanation quality (human‑rated relevance and completeness) are measured for LaaJ alone, Hints alone, and LaaJ + Hints.

Results & Findings

ConfigurationDefect CoverageExplanation Quality*
LaaJ only (average)45 %Moderate (often generic)
Analytic Hints only28 % (no deep reasoning)Low (no narrative)
LaaJ + Hints (best judge + tailored prompt)94 %High (specific, actionable)

*Explanation quality was rated by the same COBOL experts on a 1‑5 Likert scale; the hybrid approach consistently scored 4.2 vs. 2.7 for LaaJ alone.

Key observations

  • The hint injection does not require fine‑tuning the LLM; a simple prompt rewrite suffices.
  • Different judges benefit to varying degrees; the most capable model (GPT‑4) showed the largest jump, but even smaller models improved dramatically.
  • The static analyzer alone cannot explain why an issue matters, but it reliably surfaces the “what” for the LLM to elaborate on.

Practical Implications

  • Safer AI‑assisted code generation pipelines: Adding a cheap static‑analysis pre‑check can turn a flaky LLM evaluator into a near‑oracle for domain‑specific bugs.
  • Low‑overhead integration: The hint generator runs in milliseconds and can be slotted into CI/CD pipelines before the LLM judge is invoked.
  • Generalizable pattern: The same “analysis‑then‑prompt” recipe can be applied to other legacy languages (e.g., PL/SQL, Fortran) or even modern stacks where LLMs lack deep domain knowledge.
  • Reduced reliance on human review: With 94 % coverage, teams can confidently automate large portions of legacy migration QA, freeing senior engineers for higher‑level design work.
  • Prompt engineering insight: Demonstrates that dynamic, data‑driven prompt augmentation is more effective than static “few‑shot” examples for mitigating blind spots.

Limitations & Future Work

  • Domain scope: The study focuses exclusively on COBOL; the taxonomy and hint rules may not transfer directly to other languages without adaptation.
  • Static analysis depth: The current hint engine is rule‑based and may miss subtle semantic bugs that require full program analysis or runtime profiling.
  • Scalability of taxonomy creation: Building the blind‑spot taxonomy required expert annotation; automating this step remains an open challenge.
  • Evaluation breadth: Only 100 programs were tested; larger, more diverse corpora could reveal additional edge cases.
  • Future directions: Extending the hybrid framework to incorporate dynamic testing feedback, exploring automated taxonomy induction, and evaluating the approach on non‑code tasks such as documentation generation or model‑generated design specs.

Authors

  • Ora Nova Fandina
  • Eitan Farchi
  • Shmulik Froimovich
  • Raviv Gal
  • Wesam Ibraheem
  • Rami Katan
  • Alice Podolsky

Paper Information

  • arXiv ID: 2512.16272v1
  • Categories: cs.SE, cs.AI
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...