[Paper] One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Published: 15 hours ago (March 10, 2026 at 11:45 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.09821v1

Overview

The paper introduces One‑Eval, a new “agentic” system that turns natural‑language evaluation requests into fully‑automated, reproducible, and auditable workflows for Large Language Models (LLMs). By handling everything from benchmark selection to metric reporting, One‑Eval aims to cut the manual overhead that currently plagues LLM testing in both research labs and production environments.

Key Contributions

NL2Bench: Parses a user’s free‑form evaluation intent into a structured plan, recommending the most relevant benchmarks for the task.
BenchResolve: Automatically discovers, downloads, and normalizes datasets, handling schema mismatches so the evaluation can run out‑of‑the‑box.
Metrics & Reporting: Chooses task‑aware metrics, generates detailed, decision‑oriented reports (beyond a single accuracy number), and preserves evidence for each sample.
Human‑in‑the‑Loop Guardrails: Provides checkpoints for review, editing, and rollback, ensuring developers stay in control while still benefiting from automation.
Traceability & Auditing: Every step leaves an immutable evidence trail, making debugging and compliance checks straightforward.
Open‑Source Release: The full framework is available on GitHub, encouraging community extensions and industrial adoption.

Methodology

Intent Capture – A developer writes a plain‑English request (e.g., “Compare GPT‑4 and Claude on code generation for Python”). One‑Eval’s NL2Bench component uses an LLM to extract the core intent, required tasks, and any constraints.
Benchmark Planning – Based on the parsed intent, the system suggests a set of existing benchmarks (e.g., HumanEval, MBPP) and lets the user tweak the list.
Benchmark Resolution – BenchResolve automatically pulls the datasets, reconciles differing column names or formats, and creates a unified schema that downstream scripts can consume without manual wrangling.
Metric Selection & Execution – The framework matches each benchmark to the most appropriate evaluation metric (e.g., pass@k for code, BLEU for translation) and runs the LLMs against the data.
Reporting & Traceability – Results are aggregated into a rich report that includes per‑sample evidence, confidence intervals, and visual diagnostics. All actions are logged, enabling rollback or audit at any point.
Human Review – Before finalizing, a reviewer can inspect the generated plan, edit benchmark choices, or abort execution, ensuring that automation does not become a black box.

Results & Findings

End‑to‑End Automation: In experiments covering 12 diverse natural‑language requests, One‑Eval completed the full evaluation pipeline with <5 minutes of user interaction, compared to hours of manual setup.
Reproducibility Gains: The generated evidence trails allowed the authors to reproduce every reported number across different machines and OS configurations, demonstrating robustness to environment variance.
Metric Appropriateness: By automatically selecting task‑aware metrics, the system avoided common pitfalls (e.g., using BLEU for code generation) and produced more meaningful performance insights.
Developer Satisfaction: A small user study with 8 engineers reported a 70 % reduction in perceived evaluation effort and higher confidence in the resulting reports.

Practical Implications

Faster Model Iteration: Teams can spin up evaluations for new model checkpoints in minutes, accelerating the feedback loop in product development.
Standardized Benchmarks: By normalizing datasets and metrics, One‑Eval reduces “benchmark drift” where different teams inadvertently compare apples to oranges.
Compliance & Auditing: The immutable evidence logs satisfy internal governance and external regulatory requirements (e.g., for AI model documentation).
Lower Barrier to Entry: Start‑ups and smaller teams without dedicated evaluation engineers can still run sophisticated, multi‑benchmark tests using plain English prompts.
Integration Friendly: The open‑source package can be plugged into CI/CD pipelines, enabling automated regression testing for LLM updates.

Limitations & Future Work

Benchmark Coverage: The current library focuses on well‑known public benchmarks; niche or proprietary datasets still need manual integration.
LLM Dependency: NL2Bench and parts of the pipeline rely on an underlying LLM for parsing intents, which can introduce errors for ambiguous or highly technical requests.
Scalability: While the system handles moderate dataset sizes efficiently, massive corpora (e.g., multi‑GB web‑scale tests) may require additional engineering for distributed execution.
User Trust: Although human‑in‑the‑loop checkpoints exist, developers may need more transparent explanations of why certain metrics or benchmarks were chosen.

Future research directions include expanding the benchmark repository, improving intent‑parsing robustness with few‑shot prompting, and adding native support for distributed evaluation workloads.

Authors

Chengyu Shen
Yanheng Hou
Minghui Pan
Runming He
Zhen Hao Wong
Meiyi Qiang
Zhou Liu
Hao Liang
Peichao Lai
Zeang Sheng
Wentao Zhang

Paper Information

arXiv ID: 2603.09821v1
Categories: cs.CL
Published: March 10, 2026
PDF: Download PDF

[Paper] One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] CREATE: Testing LLMs for Associative Creativity

[Paper] Think Before You Lie: How Reasoning Improves Honesty

[Paper] Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions

[Paper] Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs