[Paper] One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Published: (March 10, 2026 at 11:45 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.09821v1

Overview

The paper introduces One‑Eval, a new “agentic” system that turns natural‑language evaluation requests into fully‑automated, reproducible, and auditable workflows for Large Language Models (LLMs). By handling everything from benchmark selection to metric reporting, One‑Eval aims to cut the manual overhead that currently plagues LLM testing in both research labs and production environments.

Key Contributions

  • NL2Bench: Parses a user’s free‑form evaluation intent into a structured plan, recommending the most relevant benchmarks for the task.
  • BenchResolve: Automatically discovers, downloads, and normalizes datasets, handling schema mismatches so the evaluation can run out‑of‑the‑box.
  • Metrics & Reporting: Chooses task‑aware metrics, generates detailed, decision‑oriented reports (beyond a single accuracy number), and preserves evidence for each sample.
  • Human‑in‑the‑Loop Guardrails: Provides checkpoints for review, editing, and rollback, ensuring developers stay in control while still benefiting from automation.
  • Traceability & Auditing: Every step leaves an immutable evidence trail, making debugging and compliance checks straightforward.
  • Open‑Source Release: The full framework is available on GitHub, encouraging community extensions and industrial adoption.

Methodology

  1. Intent Capture – A developer writes a plain‑English request (e.g., “Compare GPT‑4 and Claude on code generation for Python”). One‑Eval’s NL2Bench component uses an LLM to extract the core intent, required tasks, and any constraints.
  2. Benchmark Planning – Based on the parsed intent, the system suggests a set of existing benchmarks (e.g., HumanEval, MBPP) and lets the user tweak the list.
  3. Benchmark Resolution – BenchResolve automatically pulls the datasets, reconciles differing column names or formats, and creates a unified schema that downstream scripts can consume without manual wrangling.
  4. Metric Selection & Execution – The framework matches each benchmark to the most appropriate evaluation metric (e.g., pass@k for code, BLEU for translation) and runs the LLMs against the data.
  5. Reporting & Traceability – Results are aggregated into a rich report that includes per‑sample evidence, confidence intervals, and visual diagnostics. All actions are logged, enabling rollback or audit at any point.
  6. Human Review – Before finalizing, a reviewer can inspect the generated plan, edit benchmark choices, or abort execution, ensuring that automation does not become a black box.

Results & Findings

  • End‑to‑End Automation: In experiments covering 12 diverse natural‑language requests, One‑Eval completed the full evaluation pipeline with <5 minutes of user interaction, compared to hours of manual setup.
  • Reproducibility Gains: The generated evidence trails allowed the authors to reproduce every reported number across different machines and OS configurations, demonstrating robustness to environment variance.
  • Metric Appropriateness: By automatically selecting task‑aware metrics, the system avoided common pitfalls (e.g., using BLEU for code generation) and produced more meaningful performance insights.
  • Developer Satisfaction: A small user study with 8 engineers reported a 70 % reduction in perceived evaluation effort and higher confidence in the resulting reports.

Practical Implications

  • Faster Model Iteration: Teams can spin up evaluations for new model checkpoints in minutes, accelerating the feedback loop in product development.
  • Standardized Benchmarks: By normalizing datasets and metrics, One‑Eval reduces “benchmark drift” where different teams inadvertently compare apples to oranges.
  • Compliance & Auditing: The immutable evidence logs satisfy internal governance and external regulatory requirements (e.g., for AI model documentation).
  • Lower Barrier to Entry: Start‑ups and smaller teams without dedicated evaluation engineers can still run sophisticated, multi‑benchmark tests using plain English prompts.
  • Integration Friendly: The open‑source package can be plugged into CI/CD pipelines, enabling automated regression testing for LLM updates.

Limitations & Future Work

  • Benchmark Coverage: The current library focuses on well‑known public benchmarks; niche or proprietary datasets still need manual integration.
  • LLM Dependency: NL2Bench and parts of the pipeline rely on an underlying LLM for parsing intents, which can introduce errors for ambiguous or highly technical requests.
  • Scalability: While the system handles moderate dataset sizes efficiently, massive corpora (e.g., multi‑GB web‑scale tests) may require additional engineering for distributed execution.
  • User Trust: Although human‑in‑the‑loop checkpoints exist, developers may need more transparent explanations of why certain metrics or benchmarks were chosen.

Future research directions include expanding the benchmark repository, improving intent‑parsing robustness with few‑shot prompting, and adding native support for distributed evaluation workloads.

Authors

  • Chengyu Shen
  • Yanheng Hou
  • Minghui Pan
  • Runming He
  • Zhen Hao Wong
  • Meiyi Qiang
  • Zhou Liu
  • Hao Liang
  • Peichao Lai
  • Zeang Sheng
  • Wentao Zhang

Paper Information

  • arXiv ID: 2603.09821v1
  • Categories: cs.CL
  • Published: March 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »