[Paper] OpenDORS: A dataset of openly referenced open research software

Published: 4 days ago (December 1, 2025 at 06:45 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.01570v1

Overview

The paper introduces OpenDORS, a massive, openly curated dataset that links more than 134 k research software projects to the scholarly articles that cite them. By aggregating repository metadata (licenses, languages, version info) at scale, the authors give the community a concrete foundation for quantitative studies of research software engineering (RSE) practices.

Key Contributions

A large‑scale, open dataset of 134 352 unique research‑software projects together with 134 154 source‑code repositories referenced in open‑access papers.
Rich per‑repository metadata (latest release, license, primary programming language, and presence of descriptive files such as README, CITATION.cff, CODE_OF_CONDUCT).
Linkage between publications and software, enabling traceability from a research claim to the exact code version used.
Statistical overview of the dataset (e.g., language distribution, license popularity) that serves as a baseline for future RSE analyses.
Open‑source release under a permissive license, encouraging reuse, extension, and community contributions.

Methodology

Literature Harvesting – The authors mined open‑access articles from major repositories (e.g., arXiv, PubMed Central) for URLs that point to code hosting platforms (GitHub, GitLab, Bitbucket, etc.).
Deduplication & Normalization – Identical repository URLs appearing in multiple papers were collapsed, yielding a set of unique software projects.
Metadata Extraction – For each repository, the latest commit was inspected via the hosting platform’s API to collect:
- Current version tag or release name
- SPDX‑compatible license identifier
- Primary programming language (as reported by the platform)
- Presence of common metadata files (README, CITATION.cff, LICENSE, CONTRIBUTING.md).
Dataset Assembly – Each record stores the citing paper’s DOI, the repository URL, and the extracted metadata. The full collection is released as CSV/JSON files together with a small Python library for easy querying.

The pipeline is fully automated, making it straightforward to refresh the dataset as new papers appear.

Results & Findings

Coverage – 134 352 distinct software projects are linked to 134 154 repositories, showing that most papers reference a single repository, but some cite multiple.
Licensing – Over 60 % of the repositories use permissive licenses (MIT, BSD, Apache 2.0), while ~15 % are under GPL‑family licenses; the remainder lack a clear license declaration.
Language Landscape – Python dominates (≈ 45 % of projects), followed by R, Java, and C/C++. This mirrors the prevalence of data‑science and statistical computing in research.
Metadata Adoption – Only ~30 % of repositories contain a CITATION.cff file, indicating that formal citation guidance for software is still rare.
Versioning – Roughly half of the projects have an explicit release tag; the rest rely on the default master/main branch, which can hinder reproducibility.

These descriptive statistics already reveal gaps (e.g., missing licenses, sparse citation files) that RSE scholars can target in future work.

Practical Implications

Reproducibility Audits – Developers can cross‑check their own projects against the dataset to see whether they meet community norms (license, citation file, versioned releases).
Tooling for RSE – The metadata can feed dashboards that alert researchers when a referenced repository lacks a license or proper citation metadata, prompting quick remediation.
Policy & Funding – Funding agencies can use the dataset to benchmark compliance with open‑science mandates (e.g., mandatory licensing, citation of software).
Search & Discovery – Platforms like Zenodo or Figshare could integrate OpenDORS to surface the most‑cited research software, helping developers find proven codebases to build upon.
Machine‑Learning Analyses – The structured data enables large‑scale modeling of software evolution, language adoption trends, or the impact of licensing choices on citation counts.

In short, OpenDORS turns a scattered set of “software mentions” into a searchable knowledge graph that developers, repository maintainers, and research managers can act upon.

Limitations & Future Work

Open‑Access Bias – The dataset only includes papers that are freely available; software cited in pay‑walled articles is omitted, potentially skewing discipline coverage.
Repository Scope – Only publicly reachable URLs on major hosting services were captured; self‑hosted or institutional repositories may be missing.
Static Snapshot – While the pipeline can be rerun, the released version is a snapshot; continuous integration would be needed for truly up‑to‑date analyses.
Metadata Depth – The current extraction stops at high‑level fields; deeper code‑quality metrics (test coverage, CI status) are left for future extensions.

The authors plan to broaden source coverage, add dynamic quality indicators, and provide a live API so the community can keep the dataset fresh and increasingly actionable.

Authors

Stephan Druskat
Lars Grunske

Paper Information

arXiv ID: 2512.01570v1
Categories: cs.SE, cs.DL
Published: December 1, 2025
PDF: Download PDF

[Paper] OpenDORS: A dataset of openly referenced open research software

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Configuration Defects in Kubernetes

[Paper] POLARIS: Is Multi-Agentic Reasoning the Next Wave in Engineering Self-Adaptive Systems?

[Paper] Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

[Paper] PBFuzz: Agentic Directed Fuzzing for PoV Generation