[Paper] OpenDORS: A dataset of openly referenced open research software
Source: arXiv - 2512.01570v1
Overview
The paper introduces OpenDORS, a massive, openly curated dataset that links more than 134 k research software projects to the scholarly articles that cite them. By aggregating repository metadata (licenses, languages, version info) at scale, the authors give the community a concrete foundation for quantitative studies of research software engineering (RSE) practices.
Key Contributions
- A large‑scale, open dataset of 134 352 unique research‑software projects together with 134 154 source‑code repositories referenced in open‑access papers.
- Rich per‑repository metadata (latest release, license, primary programming language, and presence of descriptive files such as
README,CITATION.cff,CODE_OF_CONDUCT). - Linkage between publications and software, enabling traceability from a research claim to the exact code version used.
- Statistical overview of the dataset (e.g., language distribution, license popularity) that serves as a baseline for future RSE analyses.
- Open‑source release under a permissive license, encouraging reuse, extension, and community contributions.
Methodology
- Literature Harvesting – The authors mined open‑access articles from major repositories (e.g., arXiv, PubMed Central) for URLs that point to code hosting platforms (GitHub, GitLab, Bitbucket, etc.).
- Deduplication & Normalization – Identical repository URLs appearing in multiple papers were collapsed, yielding a set of unique software projects.
- Metadata Extraction – For each repository, the latest commit was inspected via the hosting platform’s API to collect:
- Current version tag or release name
- SPDX‑compatible license identifier
- Primary programming language (as reported by the platform)
- Presence of common metadata files (
README,CITATION.cff,LICENSE,CONTRIBUTING.md).
- Dataset Assembly – Each record stores the citing paper’s DOI, the repository URL, and the extracted metadata. The full collection is released as CSV/JSON files together with a small Python library for easy querying.
The pipeline is fully automated, making it straightforward to refresh the dataset as new papers appear.
Results & Findings
- Coverage – 134 352 distinct software projects are linked to 134 154 repositories, showing that most papers reference a single repository, but some cite multiple.
- Licensing – Over 60 % of the repositories use permissive licenses (MIT, BSD, Apache 2.0), while ~15 % are under GPL‑family licenses; the remainder lack a clear license declaration.
- Language Landscape – Python dominates (≈ 45 % of projects), followed by R, Java, and C/C++. This mirrors the prevalence of data‑science and statistical computing in research.
- Metadata Adoption – Only ~30 % of repositories contain a
CITATION.cfffile, indicating that formal citation guidance for software is still rare. - Versioning – Roughly half of the projects have an explicit release tag; the rest rely on the default
master/mainbranch, which can hinder reproducibility.
These descriptive statistics already reveal gaps (e.g., missing licenses, sparse citation files) that RSE scholars can target in future work.
Practical Implications
- Reproducibility Audits – Developers can cross‑check their own projects against the dataset to see whether they meet community norms (license, citation file, versioned releases).
- Tooling for RSE – The metadata can feed dashboards that alert researchers when a referenced repository lacks a license or proper citation metadata, prompting quick remediation.
- Policy & Funding – Funding agencies can use the dataset to benchmark compliance with open‑science mandates (e.g., mandatory licensing, citation of software).
- Search & Discovery – Platforms like Zenodo or Figshare could integrate OpenDORS to surface the most‑cited research software, helping developers find proven codebases to build upon.
- Machine‑Learning Analyses – The structured data enables large‑scale modeling of software evolution, language adoption trends, or the impact of licensing choices on citation counts.
In short, OpenDORS turns a scattered set of “software mentions” into a searchable knowledge graph that developers, repository maintainers, and research managers can act upon.
Limitations & Future Work
- Open‑Access Bias – The dataset only includes papers that are freely available; software cited in pay‑walled articles is omitted, potentially skewing discipline coverage.
- Repository Scope – Only publicly reachable URLs on major hosting services were captured; self‑hosted or institutional repositories may be missing.
- Static Snapshot – While the pipeline can be rerun, the released version is a snapshot; continuous integration would be needed for truly up‑to‑date analyses.
- Metadata Depth – The current extraction stops at high‑level fields; deeper code‑quality metrics (test coverage, CI status) are left for future extensions.
The authors plan to broaden source coverage, add dynamic quality indicators, and provide a live API so the community can keep the dataset fresh and increasingly actionable.
Authors
- Stephan Druskat
- Lars Grunske
Paper Information
- arXiv ID: 2512.01570v1
- Categories: cs.SE, cs.DL
- Published: December 1, 2025
- PDF: Download PDF