[Paper] Artisan: Agentic Artifact Evaluation

Published: (February 10, 2026 at 01:15 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.10046v1

Overview

Artifact evaluation—checking that a research paper’s code and data actually reproduce the reported results—has become a cornerstone of software‑engineering research. However, the manual effort required limits it to a handful of papers. The new Artisan system shows how a large‑language‑model (LLM) agent can automatically generate reproducible scripts, turning a traditionally labor‑intensive task into a scalable, repeatable service.

Key Contributions

  • Reframing reproduction as code generation – Artisan treats the whole reproducibility problem as “write a script that, when run, yields the paper’s numbers,” enabling the generated script to be inspected, executed, and audited independently of the LLM.
  • Automated judging mechanism – A hidden “oracle” evaluates the script’s output against the expected results without exposing the ground‑truth, preventing shortcuts like copying pre‑computed tables.
  • Artisan‑Bench benchmark – The first curated suite (60 tasks from 23 SE papers, spanning multiple languages and sub‑domains) for measuring automated artifact‑evaluation capabilities. All tasks were manually verified to be reproducible.
  • Empirical validation – Artisan produced correct reproduction scripts for 44/60 tasks, a 3.14× improvement over a strong baseline LLM agent, while averaging only 0.45 minutes of compute and ~48 minutes of wall‑clock time per task.
  • Error discovery – The system uncovered 20 previously unknown bugs or inconsistencies in the original papers or their artifacts.

Methodology

  1. Problem formulation – The authors model artifact evaluation as a script‑generation problem: given a PDF (or its parsed text) and the associated artifact repository, the LLM must output a self‑contained script (e.g., a Bash or Python driver) that reproduces the target figures/tables.
  2. LLM agent design – Artisan builds on a state‑of‑the‑art LLM (e.g., GPT‑4‑style) augmented with a tool‑use loop: the agent can invoke a sandboxed execution environment, inspect logs, and iteratively refine its script.
  3. Judging feedback – After each execution, an automated judge compares the script’s output to the expected numbers (stored in a hidden file). The judge returns only a pass/fail signal plus a high‑level hint (e.g., “numeric mismatch”), steering the agent without leaking the exact values.
  4. Benchmark construction – The authors selected 23 recent SE papers, extracted 60 reproducible experiments (different languages, build systems, datasets), and packaged each as a reproducibility task with a hidden ground truth.
  5. Baseline comparison – A “vanilla” LLM agent (mini‑swe‑agent) that receives the same inputs but lacks the iterative judging loop serves as the primary baseline.

Results & Findings

MetricArtisanBaseline (mini‑swe‑agent)
Correct reproduction scripts44 / 60 (73 %)14 / 60 (23 %)
Scripts generated per hour1.250.40
Average wall‑clock time per task≈ 48 min≈ 150 min
New errors uncovered203
  • Higher success rate: Artisan’s iterative feedback loop dramatically reduces the trial‑and‑error burden on the LLM.
  • Speed: Even with multiple execution cycles, the total time stays under an hour per task, making batch evaluation feasible.
  • Error detection: The system’s systematic checks surface hidden bugs (e.g., missing data files, mismatched hyper‑parameters) that human reviewers missed.

Practical Implications

  • Conference & journal pipelines – Journals could plug Artisan into their artifact‑evaluation workflow, automatically generating reproducibility scripts for every submission and flagging problematic artifacts before human review.
  • Continuous integration for research – Researchers can integrate Artisan into their CI pipelines to verify that their codebase still reproduces the paper after each change, catching regressions early.
  • Developer tooling – IDE extensions could invoke Artisan to auto‑generate “run‑my‑paper” scripts for open‑source research projects, lowering the barrier for practitioners to adopt new techniques.
  • Educational use – In software‑engineering courses, students can use Artisan to explore how published experiments are built, fostering a deeper understanding of reproducibility best practices.

Limitations & Future Work

  • Scope of artifacts – Artisan currently handles command‑line scripts and typical build systems; more complex environments (e.g., distributed clusters, GPU‑heavy deep‑learning pipelines) remain out of reach.
  • Reliance on LLM quality – The approach inherits the hallucination risks of LLMs; occasional nonsensical commands still require manual oversight.
  • Hidden‑oracle assumption – The judging mechanism presumes access to the exact expected outputs, which may not be available for all papers (e.g., stochastic results).
  • Future directions suggested by the authors include extending the benchmark to other SE sub‑fields (e.g., program analysis tools), incorporating richer environment specifications (Docker/Kubernetes), and exploring hybrid human‑in‑the‑loop workflows to further improve reliability.

Authors

  • Doehyun Baek
  • Michael Pradel

Paper Information

  • arXiv ID: 2602.10046v1
  • Categories: cs.SE
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »