[Paper] Artisan: Agentic Artifact Evaluation

Published: 2 days ago (February 10, 2026 at 01:15 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.10046v1

Overview

Artifact evaluation—checking that a research paper’s code and data actually reproduce the reported results—has become a cornerstone of software‑engineering research. However, the manual effort required limits it to a handful of papers. The new Artisan system shows how a large‑language‑model (LLM) agent can automatically generate reproducible scripts, turning a traditionally labor‑intensive task into a scalable, repeatable service.

Key Contributions

Reframing reproduction as code generation – Artisan treats the whole reproducibility problem as “write a script that, when run, yields the paper’s numbers,” enabling the generated script to be inspected, executed, and audited independently of the LLM.
Automated judging mechanism – A hidden “oracle” evaluates the script’s output against the expected results without exposing the ground‑truth, preventing shortcuts like copying pre‑computed tables.
Artisan‑Bench benchmark – The first curated suite (60 tasks from 23 SE papers, spanning multiple languages and sub‑domains) for measuring automated artifact‑evaluation capabilities. All tasks were manually verified to be reproducible.
Empirical validation – Artisan produced correct reproduction scripts for 44/60 tasks, a 3.14× improvement over a strong baseline LLM agent, while averaging only 0.45 minutes of compute and ~48 minutes of wall‑clock time per task.
Error discovery – The system uncovered 20 previously unknown bugs or inconsistencies in the original papers or their artifacts.

Methodology

Problem formulation – The authors model artifact evaluation as a script‑generation problem: given a PDF (or its parsed text) and the associated artifact repository, the LLM must output a self‑contained script (e.g., a Bash or Python driver) that reproduces the target figures/tables.
LLM agent design – Artisan builds on a state‑of‑the‑art LLM (e.g., GPT‑4‑style) augmented with a tool‑use loop: the agent can invoke a sandboxed execution environment, inspect logs, and iteratively refine its script.
Judging feedback – After each execution, an automated judge compares the script’s output to the expected numbers (stored in a hidden file). The judge returns only a pass/fail signal plus a high‑level hint (e.g., “numeric mismatch”), steering the agent without leaking the exact values.
Benchmark construction – The authors selected 23 recent SE papers, extracted 60 reproducible experiments (different languages, build systems, datasets), and packaged each as a reproducibility task with a hidden ground truth.
Baseline comparison – A “vanilla” LLM agent (mini‑swe‑agent) that receives the same inputs but lacks the iterative judging loop serves as the primary baseline.

Results & Findings

Metric	Artisan	Baseline (mini‑swe‑agent)
Correct reproduction scripts	44 / 60 (73 %)	14 / 60 (23 %)
Scripts generated per hour	1.25	0.40
Average wall‑clock time per task	≈ 48 min	≈ 150 min
New errors uncovered	20	3

Higher success rate: Artisan’s iterative feedback loop dramatically reduces the trial‑and‑error burden on the LLM.
Speed: Even with multiple execution cycles, the total time stays under an hour per task, making batch evaluation feasible.
Error detection: The system’s systematic checks surface hidden bugs (e.g., missing data files, mismatched hyper‑parameters) that human reviewers missed.

Practical Implications

Conference & journal pipelines – Journals could plug Artisan into their artifact‑evaluation workflow, automatically generating reproducibility scripts for every submission and flagging problematic artifacts before human review.
Continuous integration for research – Researchers can integrate Artisan into their CI pipelines to verify that their codebase still reproduces the paper after each change, catching regressions early.
Developer tooling – IDE extensions could invoke Artisan to auto‑generate “run‑my‑paper” scripts for open‑source research projects, lowering the barrier for practitioners to adopt new techniques.
Educational use – In software‑engineering courses, students can use Artisan to explore how published experiments are built, fostering a deeper understanding of reproducibility best practices.

Limitations & Future Work

Scope of artifacts – Artisan currently handles command‑line scripts and typical build systems; more complex environments (e.g., distributed clusters, GPU‑heavy deep‑learning pipelines) remain out of reach.
Reliance on LLM quality – The approach inherits the hallucination risks of LLMs; occasional nonsensical commands still require manual oversight.
Hidden‑oracle assumption – The judging mechanism presumes access to the exact expected outputs, which may not be available for all papers (e.g., stochastic results).
Future directions suggested by the authors include extending the benchmark to other SE sub‑fields (e.g., program analysis tools), incorporating richer environment specifications (Docker/Kubernetes), and exploring hybrid human‑in‑the‑loop workflows to further improve reliability.

Authors

Doehyun Baek
Michael Pradel

Paper Information

arXiv ID: 2602.10046v1
Categories: cs.SE
Published: February 10, 2026
PDF: Download PDF

[Paper] Artisan: Agentic Artifact Evaluation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Automated Test Suite Enhancement Using Large Language Models with Few-shot Prompting

[Paper] Unknown Attack Detection in IoT Networks using Large Language Models: A Robust, Data-efficient Approach

[Paper] PPTAM$η$: Energy Aware CI/CD Pipeline for Container Based Applications

[Paper] Performance Antipatterns: Angel or Devil for Power Consumption?