[Paper] Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls
Source: arXiv - 2512.16272v1
Overview
Large Language Models are increasingly being used as automated judges (LaaJ) to evaluate code generated by AI systems. This paper investigates a real‑world scenario—modernizing legacy COBOL applications—and shows that even production‑grade LaaJs miss many critical bugs. By coupling the LLM judges with a lightweight static‑analysis “hint” engine, the authors dramatically improve error detection and explanation quality.
Key Contributions
- Empirical audit of LaaJ on COBOL modernization: Demonstrates that four production‑level LLM judges catch only ~45 % of real defects in generated code.
- Domain‑specific taxonomy of blind spots: Catalogues >30 recurring COBOL‑related issues that LaaJs routinely overlook (e.g., incorrect data‑type sizing, misplaced
PERFORMstatements, legacy API misuse). - Analytic hint generator: A lightweight static‑analysis tool that flags the taxonomy’s issues and produces concise, machine‑readable hints.
- Hybrid evaluation pipeline (LaaJ + Hints): Shows that injecting these hints into the LLM’s prompt boosts detection coverage up to 94 % for the best judge, while also yielding richer explanations.
- Open resources: Releases the annotated dataset, taxonomy, prompts, and the hint‑generation code for reproducibility and community extension.
Methodology
- Data collection – The team gathered 100 COBOL programs generated by an internal code‑generation model, each paired with a ground‑truth defect list created by senior COBOL engineers.
- Baseline evaluation – Four production LaaJs (GPT‑4, Claude, Llama‑2‑Chat, and a proprietary model) were prompted to assess each program and produce error reports.
- Blind‑spot analysis – Researchers compared LaaJ outputs against the expert defect list, extracting recurring missed patterns and grouping them into a taxonomy.
- Hint engine development – A rule‑based static analyzer (≈200 lines of Python) scans a COBOL file, matches it against the taxonomy, and emits short “hint” statements (e.g.,
Check that PIC 9(5) fields are not truncated). - Hybrid prompting – The original LaaJ prompt is augmented with the generated hints, asking the model to “re‑evaluate with these considerations in mind.”
- Metrics – Coverage (percentage of true defects detected) and explanation quality (human‑rated relevance and completeness) are measured for LaaJ alone, Hints alone, and LaaJ + Hints.
Results & Findings
| Configuration | Defect Coverage | Explanation Quality* |
|---|---|---|
| LaaJ only (average) | 45 % | Moderate (often generic) |
| Analytic Hints only | 28 % (no deep reasoning) | Low (no narrative) |
| LaaJ + Hints (best judge + tailored prompt) | 94 % | High (specific, actionable) |
*Explanation quality was rated by the same COBOL experts on a 1‑5 Likert scale; the hybrid approach consistently scored 4.2 vs. 2.7 for LaaJ alone.
Key observations
- The hint injection does not require fine‑tuning the LLM; a simple prompt rewrite suffices.
- Different judges benefit to varying degrees; the most capable model (GPT‑4) showed the largest jump, but even smaller models improved dramatically.
- The static analyzer alone cannot explain why an issue matters, but it reliably surfaces the “what” for the LLM to elaborate on.
Practical Implications
- Safer AI‑assisted code generation pipelines: Adding a cheap static‑analysis pre‑check can turn a flaky LLM evaluator into a near‑oracle for domain‑specific bugs.
- Low‑overhead integration: The hint generator runs in milliseconds and can be slotted into CI/CD pipelines before the LLM judge is invoked.
- Generalizable pattern: The same “analysis‑then‑prompt” recipe can be applied to other legacy languages (e.g., PL/SQL, Fortran) or even modern stacks where LLMs lack deep domain knowledge.
- Reduced reliance on human review: With 94 % coverage, teams can confidently automate large portions of legacy migration QA, freeing senior engineers for higher‑level design work.
- Prompt engineering insight: Demonstrates that dynamic, data‑driven prompt augmentation is more effective than static “few‑shot” examples for mitigating blind spots.
Limitations & Future Work
- Domain scope: The study focuses exclusively on COBOL; the taxonomy and hint rules may not transfer directly to other languages without adaptation.
- Static analysis depth: The current hint engine is rule‑based and may miss subtle semantic bugs that require full program analysis or runtime profiling.
- Scalability of taxonomy creation: Building the blind‑spot taxonomy required expert annotation; automating this step remains an open challenge.
- Evaluation breadth: Only 100 programs were tested; larger, more diverse corpora could reveal additional edge cases.
- Future directions: Extending the hybrid framework to incorporate dynamic testing feedback, exploring automated taxonomy induction, and evaluating the approach on non‑code tasks such as documentation generation or model‑generated design specs.
Authors
- Ora Nova Fandina
- Eitan Farchi
- Shmulik Froimovich
- Raviv Gal
- Wesam Ibraheem
- Rami Katan
- Alice Podolsky
Paper Information
- arXiv ID: 2512.16272v1
- Categories: cs.SE, cs.AI
- Published: December 18, 2025
- PDF: Download PDF