[Paper] The State of Open Science in Software Engineering Research: A Case Study of ICSE Artifacts

Published: 2 weeks ago (January 5, 2026 at 07:47 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.02066v1

Overview

The paper investigates how usable the replication packages (artifacts) that accompany software‑engineering papers are in practice. By attempting to run and reproduce the results of 100 artifacts published with ICSE papers from 2015‑2024, the authors expose a sizable gap between the promise of “open science” and what developers can actually get working on their own machines.

Key Contributions

Large‑scale empirical assessment of 100 ICSE replication packages spanning a decade.
Quantitative metrics on executability (40 % runnable), required effort, and reproducibility (35 % of runnable artifacts reproduced original results).
Taxonomy of obstacles: five modification types and 13 distinct challenges (environment, documentation, structural issues).
Actionable guidelines for authors, reviewers, and conference organizers to raise the quality of artifact sharing.

Methodology

Artifact selection – The authors collected every replication package that was officially linked to an ICSE paper between 2015 and 2024 (total = 100).
Execution attempts – Over ~650 person‑hours, a team of researchers cloned each repository, set up the declared environment, and tried to run the provided scripts.
Effort classification – Each successful execution was labeled as low, moderate, or high effort based on the amount of manual tweaking required (e.g., installing missing libraries, fixing path issues).
Reproduction check – For the runnable artifacts, the team reran the experiments and compared the output with the numbers reported in the original paper.
Problem analysis – When execution failed, the researchers logged the root cause, later clustering them into a set of common modification types and challenges.

The process was deliberately transparent: all logs, scripts, and classification criteria are released alongside the paper, enabling others to replicate the study itself.

Results & Findings

Metric	Value
Executable artifacts	40 % (40/100)
Run without any change	32.5 % of executable (13/40)
Low‑effort executions	17.5 % of executable (7/40)
Moderate‑to‑high effort	82.5 % of executable (33/40)
Artifacts that reproduced original results	35 % of executable (14/40)

What this means

Availability ≠ usability – Even when authors provide a package, most developers need to invest non‑trivial time to get it running.
Reproducibility is lower than executability – Only a third of the runnable artifacts actually yielded the same numbers as the paper, indicating hidden dependencies or undocumented steps.
Common pain points – Issues fell into five modification categories (e.g., missing dependencies, hard‑coded paths, outdated libraries) and 13 challenge types, with environment configuration and insufficient documentation being the most frequent.

Practical Implications

For tool developers – When building CI pipelines or reproducibility platforms (e.g., ReproZip, Docker‑based runners), the study highlights the need for automated environment capture and dependency resolution.
For conference organizers – ICSE and similar venues could mandate containerized artifacts (Docker, OCI) or reproducibility badges that require a minimal “one‑click” execution test.
For researchers and engineers – The guidelines suggest adopting reproducible‑by‑design practices: use package managers, pin versions, provide clear README steps, and include automated sanity‑check scripts.
For industry practitioners – When evaluating academic prototypes for adoption, teams should budget extra time for artifact validation, or request containerized demos to reduce integration friction.

Limitations & Future Work

Scope limited to ICSE – Findings may not generalize to other SE venues or interdisciplinary conferences.
Binary executability metric – The study treats an artifact as “executable” if it runs at all, without grading partial success (e.g., runs but crashes later).
Human effort measurement – Effort levels were judged by the research team; automated effort metrics (e.g., number of manual commands) could provide more objective data.

Future research directions include extending the analysis to other conferences, exploring the impact of containerization standards, and building tool support that automatically flags the identified 13 challenge types during artifact submission.

Authors

Al Muttakin
Saikat Mondal
Chanchal Roy

Paper Information

arXiv ID: 2601.02066v1
Categories: cs.SE
Published: January 5, 2026
PDF: Download PDF

[Paper] The State of Open Science in Software Engineering Research: A Case Study of ICSE Artifacts

Overview

Key Contributions

Methodology

Results & Findings

What this means

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Applying Formal Methods Tools to an Electronic Warfare Codebase (Experience report)

[Paper] A Practical Guide to Establishing Technical Debt Management

[Paper] RITA: A Tool for Automated Requirements Classification and Specification from Online User Feedback

[Paper] Automation and Reuse Practices in GitHub Actions Workflows: A Practitioner's Perspective