[Paper] Analyzing GitHub Issues and Pull Requests in nf-core Pipelines: Insights into nf-core Pipeline Repositories

Published: (January 14, 2026 at 11:34 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.09612v1

Overview

The paper presents the first large‑scale empirical analysis of how developers and users interact with nf‑core pipelines—community‑curated, reproducible bio‑informatics workflows built on Nextflow. By mining over 25 k GitHub issues and pull requests (PRs), the authors surface the most common pain points, how quickly they get resolved, and which practices actually help close tickets faster.

Key Contributions

  • Dataset & Scope – Collected and cleaned 25,173 GitHub issues and PRs from all public nf‑core pipelines (covering genomics, transcriptomics, proteomics, etc.).
  • Topic Modeling – Applied BERTopic to automatically cluster the textual content, revealing 13 distinct challenge categories (e.g., pipeline development, CI configuration, container debugging).
  • Resolution Dynamics – Quantified that 89 % of issues/PRs are eventually closed, with a median resolution time of ≈3 days.
  • Impact of Metadata – Demonstrated that adding labels (large effect, δ = 0.94) and code snippets (medium effect, δ = 0.50) significantly increases the odds of an issue being resolved.
  • Prioritised Pain Points – Identified tool development & repository maintenance as the toughest hurdles, followed by testing pipelines, CI setup, and debugging containerised workflows.
  • Actionable Recommendations – Provided concrete suggestions for the nf‑core community and other scientific workflow projects to improve usability and sustainability.

Methodology

  1. Data Collection – Used the GitHub REST API to pull every issue and PR from the 30+ nf‑core repositories, filtering out bots and duplicate entries.
  2. Pre‑processing – Normalised text (lower‑casing, removing code fences, stop‑words) and extracted structured fields (labels, timestamps, presence of code snippets).
  3. Topic Extraction – Ran BERTopic, a transformer‑based clustering technique, which first embeds each issue/PR with a Sentence‑BERT model, then groups similar embeddings via HDBSCAN and finally names clusters using class‑based TF‑IDF.
  4. Statistical Analysis – Employed logistic regression and effect‑size calculations (Cohen’s δ) to test how metadata (labels, code snippets, assignee count) correlates with issue closure probability and time‑to‑close.
  5. Qualitative Validation – Randomly sampled 200 items per cluster and manually verified that the generated topics matched the underlying discussion.

The pipeline is deliberately lightweight: developers can replicate it with a few Python packages (requests, pandas, bertopic, scikit‑learn) and a GitHub personal access token.

Results & Findings

MetricObservation
Closed issues/PRs89.38 % of the 25 k items are closed.
Median time‑to‑close3 days (≈50 % resolved within this window).
Effect of labelsAdding at least one label boosts closure odds by a large effect (δ = 0.94).
Effect of code snippetsIncluding a code block raises closure odds by a medium effect (δ = 0.50).
Top challenge clusters1️⃣ Tool development & repo maintenance
2️⃣ Testing pipelines & CI configuration
3️⃣ Debugging containerised workflows
Least problematicDocumentation‑only requests and feature suggestions tend to linger longer and are less likely to be closed quickly.

These numbers suggest that nf‑core’s governance (mandatory CI, peer review) is working: most contributions are triaged and resolved promptly, but certain technical domains still cause friction.

Practical Implications

  • For Pipeline Authors – Adding clear, descriptive labels (e.g., bug, enhancement, CI) and embedding minimal reproducible code snippets in issue bodies can dramatically speed up triage.
  • For CI Engineers – The high incidence of CI‑related tickets signals a need for more robust, reusable CI templates (e.g., pre‑configured GitHub Actions for Nextflow).
  • For Tool Developers – The “tool development & repo maintenance” cluster highlights that many problems stem from upstream software changes; adopting semantic versioning and automated dependency checks can reduce breakage.
  • For End‑Users – Knowing that most problems are resolved within a few days, users can feel confident filing detailed issues, especially when they include reproducible examples.
  • For Other SWfMS Communities – The methodology (large‑scale issue mining + BERTopic) can be reused to audit the health of any open‑source workflow ecosystem (Snakemake, CWL, Galaxy).
  • Automation Opportunities – Bots that automatically suggest labels or request code snippets when an issue is opened could be integrated into the nf‑core workflow, cutting down manual triage time.

Limitations & Future Work

  • Scope limited to GitHub – Projects hosted elsewhere (GitLab, Bitbucket) are not represented, possibly biasing the findings toward the more active nf‑core community.
  • Topic granularity – While BERTopic provides coherent clusters, some nuanced challenges (e.g., specific container runtime bugs) may be merged into broader categories.
  • Causality vs. correlation – The study shows that labels and code snippets correlate with faster resolution, but does not prove they cause it; controlled experiments (e.g., A/B testing label prompts) would be needed.
  • Temporal drift – The dataset spans several years; the nature of challenges may evolve as Nextflow and nf‑core mature. Future work could perform a longitudinal analysis to track shifting pain points.

By addressing these gaps, subsequent research can refine tooling, improve community guidelines, and ultimately make reproducible bio‑informatics pipelines even more developer‑friendly.

Authors

  • Khairul Alam
  • Banani Roy

Paper Information

  • arXiv ID: 2601.09612v1
  • Categories: cs.SE
  • Published: January 14, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »