[Paper] Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization
Source: arXiv - 2604.20726v1
Overview
The paper investigates how the way we prompt large language models (LLMs) that act as judges can dramatically affect the quality of free‑text legal question‑answering systems. By automatically optimizing prompts rather than hand‑crafting them, the authors show that you can get more reliable, transferable evaluations—an insight that matters for anyone building or benchmarking AI‑driven legal tools.
Key Contributions
- Prompt‑optimization pipeline (ProTeGi) applied to legal QA, demonstrating systematic improvement over manually designed prompts.
- Empirical comparison of judge feedback styles (lenient vs. strict) and their impact on prompt quality.
- Cross‑judge transfer experiments, revealing that prompts tuned with lenient judges generalize better to stricter judges than the reverse.
- Open‑source release of code, benchmark data, and the optimized prompts for reproducibility.
Methodology
- Benchmark & Models – The authors use the LEXam legal QA benchmark and evaluate four task models (different LLMs that generate answers).
- Judges – Two LLMs serve as “judges”: Qwen‑3‑32B (lenient feedback) and DeepSeek‑V3 (strict feedback). Each judge scores a model’s answer as correct/incorrect based on a prompt.
- Prompt Optimization (ProTeGi) –
- Start with a baseline task prompt (the instruction given to the judge).
- Generate a pool of candidate prompts by mutating wording, format, and examples.
- Run each candidate on a training subset of LEXam, collect the judge’s feedback, and compute a reward (e.g., agreement with a gold label).
- Use a simple evolutionary search to keep the highest‑scoring prompts and iterate.
- Evaluation – After optimization, the best prompt is tested on a held‑out validation set. The authors also swap judges to see whether a prompt optimized for one judge works for the other.
The whole process is fully automated; developers only need to supply the benchmark data and choose a judge model.
Results & Findings
| Scenario | Baseline (human‑crafted) | Optimized Prompt (lenient judge) | Optimized Prompt (strict judge) |
|---|---|---|---|
| Same judge, same task model | 68.2 % accuracy | 74.9 % (+6.7) | 71.5 % (+3.3) |
| Cross‑judge transfer (lenient→strict) | – | 73.1 % (still high) | – |
| Cross‑judge transfer (strict→lenient) | – | – | 68.9 % (drop) |
- Lenient feedback wins: Prompts tuned with the permissive judge consistently gave larger gains and were more stable across runs.
- Better transferability: A lenient‑optimized prompt retained most of its advantage when evaluated by the stricter judge, while the opposite direction suffered a noticeable drop.
- Why? Analysis of the generated prompts shows lenient judges encourage broader criteria (e.g., “covers the main legal principle”), whereas strict judges push for narrow, surface‑level matches, leading to over‑fitting to that judge’s idiosyncrasies.
Practical Implications
- Automated prompt tuning can replace manual prompt engineering for legal‑QA evaluation pipelines, saving time for dev teams.
- Choosing a lenient judge during optimization yields more robust evaluation scripts that can be reused with stricter judges later, simplifying multi‑judge benchmarking setups.
- Open‑source prompts can be dropped into existing pipelines (e.g., LangChain, LlamaIndex) to improve the reliability of automated legal answer grading without retraining the underlying LLM.
- Generalizable lesson: In any domain where an LLM is used as a “judge” (code review, content moderation, fact‑checking), start with a permissive feedback style for prompt search to avoid over‑fitting to a single evaluator.
Limitations & Future Work
- The study is confined to one benchmark (LEXam) and four task models; results may differ on other legal corpora or multilingual settings.
- Only two judge LLMs were examined; the spectrum of possible feedback styles (e.g., hybrid or domain‑specialized judges) remains unexplored.
- Prompt optimization used a simple evolutionary search; more sophisticated methods (RL‑based prompt generation, differentiable prompting) could yield further gains.
- Future research could investigate dynamic prompt adaptation—changing the judge prompt on‑the‑fly based on answer difficulty—or extend the framework to multi‑turn legal dialogues.
Authors
- Mohamed Hesham Elganayni
- Runsheng Chen
- Sebastian Nagl
- Matthias Grabmair
Paper Information
- arXiv ID: 2604.20726v1
- Categories: cs.CL, cs.AI
- Published: April 22, 2026
- PDF: Download PDF