[Paper] The State of Open Science in Software Engineering Research: A Case Study of ICSE Artifacts

Published: (January 5, 2026 at 07:47 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.02066v1

Overview

The paper investigates how usable the replication packages (artifacts) that accompany software‑engineering papers are in practice. By attempting to run and reproduce the results of 100 artifacts published with ICSE papers from 2015‑2024, the authors expose a sizable gap between the promise of “open science” and what developers can actually get working on their own machines.

Key Contributions

  • Large‑scale empirical assessment of 100 ICSE replication packages spanning a decade.
  • Quantitative metrics on executability (40 % runnable), required effort, and reproducibility (35 % of runnable artifacts reproduced original results).
  • Taxonomy of obstacles: five modification types and 13 distinct challenges (environment, documentation, structural issues).
  • Actionable guidelines for authors, reviewers, and conference organizers to raise the quality of artifact sharing.

Methodology

  1. Artifact selection – The authors collected every replication package that was officially linked to an ICSE paper between 2015 and 2024 (total = 100).
  2. Execution attempts – Over ~650 person‑hours, a team of researchers cloned each repository, set up the declared environment, and tried to run the provided scripts.
  3. Effort classification – Each successful execution was labeled as low, moderate, or high effort based on the amount of manual tweaking required (e.g., installing missing libraries, fixing path issues).
  4. Reproduction check – For the runnable artifacts, the team reran the experiments and compared the output with the numbers reported in the original paper.
  5. Problem analysis – When execution failed, the researchers logged the root cause, later clustering them into a set of common modification types and challenges.

The process was deliberately transparent: all logs, scripts, and classification criteria are released alongside the paper, enabling others to replicate the study itself.

Results & Findings

MetricValue
Executable artifacts40 % (40/100)
Run without any change32.5 % of executable (13/40)
Low‑effort executions17.5 % of executable (7/40)
Moderate‑to‑high effort82.5 % of executable (33/40)
Artifacts that reproduced original results35 % of executable (14/40)

What this means

  • Availability ≠ usability – Even when authors provide a package, most developers need to invest non‑trivial time to get it running.
  • Reproducibility is lower than executability – Only a third of the runnable artifacts actually yielded the same numbers as the paper, indicating hidden dependencies or undocumented steps.
  • Common pain points – Issues fell into five modification categories (e.g., missing dependencies, hard‑coded paths, outdated libraries) and 13 challenge types, with environment configuration and insufficient documentation being the most frequent.

Practical Implications

  • For tool developers – When building CI pipelines or reproducibility platforms (e.g., ReproZip, Docker‑based runners), the study highlights the need for automated environment capture and dependency resolution.
  • For conference organizers – ICSE and similar venues could mandate containerized artifacts (Docker, OCI) or reproducibility badges that require a minimal “one‑click” execution test.
  • For researchers and engineers – The guidelines suggest adopting reproducible‑by‑design practices: use package managers, pin versions, provide clear README steps, and include automated sanity‑check scripts.
  • For industry practitioners – When evaluating academic prototypes for adoption, teams should budget extra time for artifact validation, or request containerized demos to reduce integration friction.

Limitations & Future Work

  • Scope limited to ICSE – Findings may not generalize to other SE venues or interdisciplinary conferences.
  • Binary executability metric – The study treats an artifact as “executable” if it runs at all, without grading partial success (e.g., runs but crashes later).
  • Human effort measurement – Effort levels were judged by the research team; automated effort metrics (e.g., number of manual commands) could provide more objective data.

Future research directions include extending the analysis to other conferences, exploring the impact of containerization standards, and building tool support that automatically flags the identified 13 challenge types during artifact submission.

Authors

  • Al Muttakin
  • Saikat Mondal
  • Chanchal Roy

Paper Information

  • arXiv ID: 2601.02066v1
  • Categories: cs.SE
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »