[Paper] Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls

Published: 1 month ago (December 18, 2025 at 02:43 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.16272v1

Overview

Large Language Models are increasingly being used as automated judges (LaaJ) to evaluate code generated by AI systems. This paper investigates a real‑world scenario—modernizing legacy COBOL applications—and shows that even production‑grade LaaJs miss many critical bugs. By coupling the LLM judges with a lightweight static‑analysis “hint” engine, the authors dramatically improve error detection and explanation quality.

Key Contributions

Empirical audit of LaaJ on COBOL modernization: Demonstrates that four production‑level LLM judges catch only ~45 % of real defects in generated code.
Domain‑specific taxonomy of blind spots: Catalogues >30 recurring COBOL‑related issues that LaaJs routinely overlook (e.g., incorrect data‑type sizing, misplaced PERFORM statements, legacy API misuse).
Analytic hint generator: A lightweight static‑analysis tool that flags the taxonomy’s issues and produces concise, machine‑readable hints.
Hybrid evaluation pipeline (LaaJ + Hints): Shows that injecting these hints into the LLM’s prompt boosts detection coverage up to 94 % for the best judge, while also yielding richer explanations.
Open resources: Releases the annotated dataset, taxonomy, prompts, and the hint‑generation code for reproducibility and community extension.

Methodology

Data collection – The team gathered 100 COBOL programs generated by an internal code‑generation model, each paired with a ground‑truth defect list created by senior COBOL engineers.
Baseline evaluation – Four production LaaJs (GPT‑4, Claude, Llama‑2‑Chat, and a proprietary model) were prompted to assess each program and produce error reports.
Blind‑spot analysis – Researchers compared LaaJ outputs against the expert defect list, extracting recurring missed patterns and grouping them into a taxonomy.
Hint engine development – A rule‑based static analyzer (≈200 lines of Python) scans a COBOL file, matches it against the taxonomy, and emits short “hint” statements (e.g., Check that PIC 9(5) fields are not truncated).
Hybrid prompting – The original LaaJ prompt is augmented with the generated hints, asking the model to “re‑evaluate with these considerations in mind.”
Metrics – Coverage (percentage of true defects detected) and explanation quality (human‑rated relevance and completeness) are measured for LaaJ alone, Hints alone, and LaaJ + Hints.

Results & Findings

Configuration	Defect Coverage	Explanation Quality*
LaaJ only (average)	45 %	Moderate (often generic)
Analytic Hints only	28 % (no deep reasoning)	Low (no narrative)
LaaJ + Hints (best judge + tailored prompt)	94 %	High (specific, actionable)

*Explanation quality was rated by the same COBOL experts on a 1‑5 Likert scale; the hybrid approach consistently scored 4.2 vs. 2.7 for LaaJ alone.

Key observations

The hint injection does not require fine‑tuning the LLM; a simple prompt rewrite suffices.
Different judges benefit to varying degrees; the most capable model (GPT‑4) showed the largest jump, but even smaller models improved dramatically.
The static analyzer alone cannot explain why an issue matters, but it reliably surfaces the “what” for the LLM to elaborate on.

Practical Implications

Safer AI‑assisted code generation pipelines: Adding a cheap static‑analysis pre‑check can turn a flaky LLM evaluator into a near‑oracle for domain‑specific bugs.
Low‑overhead integration: The hint generator runs in milliseconds and can be slotted into CI/CD pipelines before the LLM judge is invoked.
Generalizable pattern: The same “analysis‑then‑prompt” recipe can be applied to other legacy languages (e.g., PL/SQL, Fortran) or even modern stacks where LLMs lack deep domain knowledge.
Reduced reliance on human review: With 94 % coverage, teams can confidently automate large portions of legacy migration QA, freeing senior engineers for higher‑level design work.
Prompt engineering insight: Demonstrates that dynamic, data‑driven prompt augmentation is more effective than static “few‑shot” examples for mitigating blind spots.

Limitations & Future Work

Domain scope: The study focuses exclusively on COBOL; the taxonomy and hint rules may not transfer directly to other languages without adaptation.
Static analysis depth: The current hint engine is rule‑based and may miss subtle semantic bugs that require full program analysis or runtime profiling.
Scalability of taxonomy creation: Building the blind‑spot taxonomy required expert annotation; automating this step remains an open challenge.
Evaluation breadth: Only 100 programs were tested; larger, more diverse corpora could reveal additional edge cases.
Future directions: Extending the hybrid framework to incorporate dynamic testing feedback, exploring automated taxonomy induction, and evaluating the approach on non‑code tasks such as documentation generation or model‑generated design specs.

Authors

Ora Nova Fandina
Eitan Farchi
Shmulik Froimovich
Raviv Gal
Wesam Ibraheem
Rami Katan
Alice Podolsky

Paper Information

arXiv ID: 2512.16272v1
Categories: cs.SE, cs.AI
Published: December 18, 2025
PDF: Download PDF

[Paper] Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy