[Paper] OpenDORS: A dataset of openly referenced open research software

Published: (December 1, 2025 at 06:45 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.01570v1

Overview

The paper introduces OpenDORS, a massive, openly curated dataset that links more than 134 k research software projects to the scholarly articles that cite them. By aggregating repository metadata (licenses, languages, version info) at scale, the authors give the community a concrete foundation for quantitative studies of research software engineering (RSE) practices.

Key Contributions

  • A large‑scale, open dataset of 134 352 unique research‑software projects together with 134 154 source‑code repositories referenced in open‑access papers.
  • Rich per‑repository metadata (latest release, license, primary programming language, and presence of descriptive files such as README, CITATION.cff, CODE_OF_CONDUCT).
  • Linkage between publications and software, enabling traceability from a research claim to the exact code version used.
  • Statistical overview of the dataset (e.g., language distribution, license popularity) that serves as a baseline for future RSE analyses.
  • Open‑source release under a permissive license, encouraging reuse, extension, and community contributions.

Methodology

  1. Literature Harvesting – The authors mined open‑access articles from major repositories (e.g., arXiv, PubMed Central) for URLs that point to code hosting platforms (GitHub, GitLab, Bitbucket, etc.).
  2. Deduplication & Normalization – Identical repository URLs appearing in multiple papers were collapsed, yielding a set of unique software projects.
  3. Metadata Extraction – For each repository, the latest commit was inspected via the hosting platform’s API to collect:
    • Current version tag or release name
    • SPDX‑compatible license identifier
    • Primary programming language (as reported by the platform)
    • Presence of common metadata files (README, CITATION.cff, LICENSE, CONTRIBUTING.md).
  4. Dataset Assembly – Each record stores the citing paper’s DOI, the repository URL, and the extracted metadata. The full collection is released as CSV/JSON files together with a small Python library for easy querying.

The pipeline is fully automated, making it straightforward to refresh the dataset as new papers appear.

Results & Findings

  • Coverage – 134 352 distinct software projects are linked to 134 154 repositories, showing that most papers reference a single repository, but some cite multiple.
  • Licensing – Over 60 % of the repositories use permissive licenses (MIT, BSD, Apache 2.0), while ~15 % are under GPL‑family licenses; the remainder lack a clear license declaration.
  • Language Landscape – Python dominates (≈ 45 % of projects), followed by R, Java, and C/C++. This mirrors the prevalence of data‑science and statistical computing in research.
  • Metadata Adoption – Only ~30 % of repositories contain a CITATION.cff file, indicating that formal citation guidance for software is still rare.
  • Versioning – Roughly half of the projects have an explicit release tag; the rest rely on the default master/main branch, which can hinder reproducibility.

These descriptive statistics already reveal gaps (e.g., missing licenses, sparse citation files) that RSE scholars can target in future work.

Practical Implications

  • Reproducibility Audits – Developers can cross‑check their own projects against the dataset to see whether they meet community norms (license, citation file, versioned releases).
  • Tooling for RSE – The metadata can feed dashboards that alert researchers when a referenced repository lacks a license or proper citation metadata, prompting quick remediation.
  • Policy & Funding – Funding agencies can use the dataset to benchmark compliance with open‑science mandates (e.g., mandatory licensing, citation of software).
  • Search & Discovery – Platforms like Zenodo or Figshare could integrate OpenDORS to surface the most‑cited research software, helping developers find proven codebases to build upon.
  • Machine‑Learning Analyses – The structured data enables large‑scale modeling of software evolution, language adoption trends, or the impact of licensing choices on citation counts.

In short, OpenDORS turns a scattered set of “software mentions” into a searchable knowledge graph that developers, repository maintainers, and research managers can act upon.

Limitations & Future Work

  • Open‑Access Bias – The dataset only includes papers that are freely available; software cited in pay‑walled articles is omitted, potentially skewing discipline coverage.
  • Repository Scope – Only publicly reachable URLs on major hosting services were captured; self‑hosted or institutional repositories may be missing.
  • Static Snapshot – While the pipeline can be rerun, the released version is a snapshot; continuous integration would be needed for truly up‑to‑date analyses.
  • Metadata Depth – The current extraction stops at high‑level fields; deeper code‑quality metrics (test coverage, CI status) are left for future extensions.

The authors plan to broaden source coverage, add dynamic quality indicators, and provide a live API so the community can keep the dataset fresh and increasingly actionable.

Authors

  • Stephan Druskat
  • Lars Grunske

Paper Information

  • arXiv ID: 2512.01570v1
  • Categories: cs.SE, cs.DL
  • Published: December 1, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »