[Paper] Analyzing GitHub Issues and Pull Requests in nf-core Pipelines: Insights into nf-core Pipeline Repositories
Source: arXiv - 2601.09612v1
Overview
The paper presents the first large‑scale empirical analysis of how developers and users interact with nf‑core pipelines—community‑curated, reproducible bio‑informatics workflows built on Nextflow. By mining over 25 k GitHub issues and pull requests (PRs), the authors surface the most common pain points, how quickly they get resolved, and which practices actually help close tickets faster.
Key Contributions
- Dataset & Scope – Collected and cleaned 25,173 GitHub issues and PRs from all public nf‑core pipelines (covering genomics, transcriptomics, proteomics, etc.).
- Topic Modeling – Applied BERTopic to automatically cluster the textual content, revealing 13 distinct challenge categories (e.g., pipeline development, CI configuration, container debugging).
- Resolution Dynamics – Quantified that 89 % of issues/PRs are eventually closed, with a median resolution time of ≈3 days.
- Impact of Metadata – Demonstrated that adding labels (large effect, δ = 0.94) and code snippets (medium effect, δ = 0.50) significantly increases the odds of an issue being resolved.
- Prioritised Pain Points – Identified tool development & repository maintenance as the toughest hurdles, followed by testing pipelines, CI setup, and debugging containerised workflows.
- Actionable Recommendations – Provided concrete suggestions for the nf‑core community and other scientific workflow projects to improve usability and sustainability.
Methodology
- Data Collection – Used the GitHub REST API to pull every issue and PR from the 30+ nf‑core repositories, filtering out bots and duplicate entries.
- Pre‑processing – Normalised text (lower‑casing, removing code fences, stop‑words) and extracted structured fields (labels, timestamps, presence of code snippets).
- Topic Extraction – Ran BERTopic, a transformer‑based clustering technique, which first embeds each issue/PR with a Sentence‑BERT model, then groups similar embeddings via HDBSCAN and finally names clusters using class‑based TF‑IDF.
- Statistical Analysis – Employed logistic regression and effect‑size calculations (Cohen’s δ) to test how metadata (labels, code snippets, assignee count) correlates with issue closure probability and time‑to‑close.
- Qualitative Validation – Randomly sampled 200 items per cluster and manually verified that the generated topics matched the underlying discussion.
The pipeline is deliberately lightweight: developers can replicate it with a few Python packages (requests, pandas, bertopic, scikit‑learn) and a GitHub personal access token.
Results & Findings
| Metric | Observation |
|---|---|
| Closed issues/PRs | 89.38 % of the 25 k items are closed. |
| Median time‑to‑close | 3 days (≈50 % resolved within this window). |
| Effect of labels | Adding at least one label boosts closure odds by a large effect (δ = 0.94). |
| Effect of code snippets | Including a code block raises closure odds by a medium effect (δ = 0.50). |
| Top challenge clusters | 1️⃣ Tool development & repo maintenance 2️⃣ Testing pipelines & CI configuration 3️⃣ Debugging containerised workflows |
| Least problematic | Documentation‑only requests and feature suggestions tend to linger longer and are less likely to be closed quickly. |
These numbers suggest that nf‑core’s governance (mandatory CI, peer review) is working: most contributions are triaged and resolved promptly, but certain technical domains still cause friction.
Practical Implications
- For Pipeline Authors – Adding clear, descriptive labels (e.g.,
bug,enhancement,CI) and embedding minimal reproducible code snippets in issue bodies can dramatically speed up triage. - For CI Engineers – The high incidence of CI‑related tickets signals a need for more robust, reusable CI templates (e.g., pre‑configured GitHub Actions for Nextflow).
- For Tool Developers – The “tool development & repo maintenance” cluster highlights that many problems stem from upstream software changes; adopting semantic versioning and automated dependency checks can reduce breakage.
- For End‑Users – Knowing that most problems are resolved within a few days, users can feel confident filing detailed issues, especially when they include reproducible examples.
- For Other SWfMS Communities – The methodology (large‑scale issue mining + BERTopic) can be reused to audit the health of any open‑source workflow ecosystem (Snakemake, CWL, Galaxy).
- Automation Opportunities – Bots that automatically suggest labels or request code snippets when an issue is opened could be integrated into the nf‑core workflow, cutting down manual triage time.
Limitations & Future Work
- Scope limited to GitHub – Projects hosted elsewhere (GitLab, Bitbucket) are not represented, possibly biasing the findings toward the more active nf‑core community.
- Topic granularity – While BERTopic provides coherent clusters, some nuanced challenges (e.g., specific container runtime bugs) may be merged into broader categories.
- Causality vs. correlation – The study shows that labels and code snippets correlate with faster resolution, but does not prove they cause it; controlled experiments (e.g., A/B testing label prompts) would be needed.
- Temporal drift – The dataset spans several years; the nature of challenges may evolve as Nextflow and nf‑core mature. Future work could perform a longitudinal analysis to track shifting pain points.
By addressing these gaps, subsequent research can refine tooling, improve community guidelines, and ultimately make reproducible bio‑informatics pipelines even more developer‑friendly.
Authors
- Khairul Alam
- Banani Roy
Paper Information
- arXiv ID: 2601.09612v1
- Categories: cs.SE
- Published: January 14, 2026
- PDF: Download PDF