[Paper] Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

Published: 3 days ago (April 22, 2026 at 12:12 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.20726v1

Overview

The paper investigates how the way we prompt large language models (LLMs) that act as judges can dramatically affect the quality of free‑text legal question‑answering systems. By automatically optimizing prompts rather than hand‑crafting them, the authors show that you can get more reliable, transferable evaluations—an insight that matters for anyone building or benchmarking AI‑driven legal tools.

Key Contributions

Prompt‑optimization pipeline (ProTeGi) applied to legal QA, demonstrating systematic improvement over manually designed prompts.
Empirical comparison of judge feedback styles (lenient vs. strict) and their impact on prompt quality.
Cross‑judge transfer experiments, revealing that prompts tuned with lenient judges generalize better to stricter judges than the reverse.
Open‑source release of code, benchmark data, and the optimized prompts for reproducibility.

Methodology

Benchmark & Models – The authors use the LEXam legal QA benchmark and evaluate four task models (different LLMs that generate answers).
Judges – Two LLMs serve as “judges”: Qwen‑3‑32B (lenient feedback) and DeepSeek‑V3 (strict feedback). Each judge scores a model’s answer as correct/incorrect based on a prompt.
Prompt Optimization (ProTeGi) –
- Start with a baseline task prompt (the instruction given to the judge).
- Generate a pool of candidate prompts by mutating wording, format, and examples.
- Run each candidate on a training subset of LEXam, collect the judge’s feedback, and compute a reward (e.g., agreement with a gold label).
- Use a simple evolutionary search to keep the highest‑scoring prompts and iterate.
Evaluation – After optimization, the best prompt is tested on a held‑out validation set. The authors also swap judges to see whether a prompt optimized for one judge works for the other.

The whole process is fully automated; developers only need to supply the benchmark data and choose a judge model.

Results & Findings

Scenario	Baseline (human‑crafted)	Optimized Prompt (lenient judge)	Optimized Prompt (strict judge)
Same judge, same task model	68.2 % accuracy	74.9 % (+6.7)	71.5 % (+3.3)
Cross‑judge transfer (lenient→strict)	–	73.1 % (still high)	–
Cross‑judge transfer (strict→lenient)	–	–	68.9 % (drop)

Lenient feedback wins: Prompts tuned with the permissive judge consistently gave larger gains and were more stable across runs.
Better transferability: A lenient‑optimized prompt retained most of its advantage when evaluated by the stricter judge, while the opposite direction suffered a noticeable drop.
Why? Analysis of the generated prompts shows lenient judges encourage broader criteria (e.g., “covers the main legal principle”), whereas strict judges push for narrow, surface‑level matches, leading to over‑fitting to that judge’s idiosyncrasies.

Practical Implications

Automated prompt tuning can replace manual prompt engineering for legal‑QA evaluation pipelines, saving time for dev teams.
Choosing a lenient judge during optimization yields more robust evaluation scripts that can be reused with stricter judges later, simplifying multi‑judge benchmarking setups.
Open‑source prompts can be dropped into existing pipelines (e.g., LangChain, LlamaIndex) to improve the reliability of automated legal answer grading without retraining the underlying LLM.
Generalizable lesson: In any domain where an LLM is used as a “judge” (code review, content moderation, fact‑checking), start with a permissive feedback style for prompt search to avoid over‑fitting to a single evaluator.

Limitations & Future Work

The study is confined to one benchmark (LEXam) and four task models; results may differ on other legal corpora or multilingual settings.
Only two judge LLMs were examined; the spectrum of possible feedback styles (e.g., hybrid or domain‑specialized judges) remains unexplored.
Prompt optimization used a simple evolutionary search; more sophisticated methods (RL‑based prompt generation, differentiable prompting) could yield further gains.
Future research could investigate dynamic prompt adaptation—changing the judge prompt on‑the‑fly based on answer difficulty—or extend the framework to multi‑turn legal dialogues.

Authors

Mohamed Hesham Elganayni
Runsheng Chen
Sebastian Nagl
Matthias Grabmair

Paper Information

arXiv ID: 2604.20726v1
Categories: cs.CL, cs.AI
Published: April 22, 2026
PDF: Download PDF

[Paper] Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation

[Paper] TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

[Paper] A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents