[Paper] Artisan: Agentic Artifact Evaluation
Source: arXiv - 2602.10046v1
Overview
Artifact evaluation—checking that a research paper’s code and data actually reproduce the reported results—has become a cornerstone of software‑engineering research. However, the manual effort required limits it to a handful of papers. The new Artisan system shows how a large‑language‑model (LLM) agent can automatically generate reproducible scripts, turning a traditionally labor‑intensive task into a scalable, repeatable service.
Key Contributions
- Reframing reproduction as code generation – Artisan treats the whole reproducibility problem as “write a script that, when run, yields the paper’s numbers,” enabling the generated script to be inspected, executed, and audited independently of the LLM.
- Automated judging mechanism – A hidden “oracle” evaluates the script’s output against the expected results without exposing the ground‑truth, preventing shortcuts like copying pre‑computed tables.
- Artisan‑Bench benchmark – The first curated suite (60 tasks from 23 SE papers, spanning multiple languages and sub‑domains) for measuring automated artifact‑evaluation capabilities. All tasks were manually verified to be reproducible.
- Empirical validation – Artisan produced correct reproduction scripts for 44/60 tasks, a 3.14× improvement over a strong baseline LLM agent, while averaging only 0.45 minutes of compute and ~48 minutes of wall‑clock time per task.
- Error discovery – The system uncovered 20 previously unknown bugs or inconsistencies in the original papers or their artifacts.
Methodology
- Problem formulation – The authors model artifact evaluation as a script‑generation problem: given a PDF (or its parsed text) and the associated artifact repository, the LLM must output a self‑contained script (e.g., a Bash or Python driver) that reproduces the target figures/tables.
- LLM agent design – Artisan builds on a state‑of‑the‑art LLM (e.g., GPT‑4‑style) augmented with a tool‑use loop: the agent can invoke a sandboxed execution environment, inspect logs, and iteratively refine its script.
- Judging feedback – After each execution, an automated judge compares the script’s output to the expected numbers (stored in a hidden file). The judge returns only a pass/fail signal plus a high‑level hint (e.g., “numeric mismatch”), steering the agent without leaking the exact values.
- Benchmark construction – The authors selected 23 recent SE papers, extracted 60 reproducible experiments (different languages, build systems, datasets), and packaged each as a reproducibility task with a hidden ground truth.
- Baseline comparison – A “vanilla” LLM agent (mini‑swe‑agent) that receives the same inputs but lacks the iterative judging loop serves as the primary baseline.
Results & Findings
| Metric | Artisan | Baseline (mini‑swe‑agent) |
|---|---|---|
| Correct reproduction scripts | 44 / 60 (73 %) | 14 / 60 (23 %) |
| Scripts generated per hour | 1.25 | 0.40 |
| Average wall‑clock time per task | ≈ 48 min | ≈ 150 min |
| New errors uncovered | 20 | 3 |
- Higher success rate: Artisan’s iterative feedback loop dramatically reduces the trial‑and‑error burden on the LLM.
- Speed: Even with multiple execution cycles, the total time stays under an hour per task, making batch evaluation feasible.
- Error detection: The system’s systematic checks surface hidden bugs (e.g., missing data files, mismatched hyper‑parameters) that human reviewers missed.
Practical Implications
- Conference & journal pipelines – Journals could plug Artisan into their artifact‑evaluation workflow, automatically generating reproducibility scripts for every submission and flagging problematic artifacts before human review.
- Continuous integration for research – Researchers can integrate Artisan into their CI pipelines to verify that their codebase still reproduces the paper after each change, catching regressions early.
- Developer tooling – IDE extensions could invoke Artisan to auto‑generate “run‑my‑paper” scripts for open‑source research projects, lowering the barrier for practitioners to adopt new techniques.
- Educational use – In software‑engineering courses, students can use Artisan to explore how published experiments are built, fostering a deeper understanding of reproducibility best practices.
Limitations & Future Work
- Scope of artifacts – Artisan currently handles command‑line scripts and typical build systems; more complex environments (e.g., distributed clusters, GPU‑heavy deep‑learning pipelines) remain out of reach.
- Reliance on LLM quality – The approach inherits the hallucination risks of LLMs; occasional nonsensical commands still require manual oversight.
- Hidden‑oracle assumption – The judging mechanism presumes access to the exact expected outputs, which may not be available for all papers (e.g., stochastic results).
- Future directions suggested by the authors include extending the benchmark to other SE sub‑fields (e.g., program analysis tools), incorporating richer environment specifications (Docker/Kubernetes), and exploring hybrid human‑in‑the‑loop workflows to further improve reliability.
Authors
- Doehyun Baek
- Michael Pradel
Paper Information
- arXiv ID: 2602.10046v1
- Categories: cs.SE
- Published: February 10, 2026
- PDF: Download PDF