[Paper] Designing FAIR Workflows at OLCF: Building Scalable and Reusable Ecosystems for HPC Science

Published: 2 months ago (December 2, 2025 at 09:27 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.02818v1

Overview

The paper Designing FAIR Workflows at OLCF examines how the Oak Ridge Leadership Computing Facility (OLCF) can turn its massive HPC resources into a reusable, discoverable ecosystem for scientific software and workflows. By extending the FAIR (Findable, Accessible, Interoperable, Reusable) principles beyond data to the building blocks of HPC pipelines, the authors propose a concrete architecture that could cut duplication, speed up onboarding, and make large‑scale science more collaborative across disciplines.

Key Contributions

Component‑centric FAIR model: Shifts the focus from whole workflows to individual workflow components (e.g., container images, scripts, libraries) to better match the modular, evolving nature of HPC work.
Adaptation of EOSC‑Life FAIR Workflows Collaboratory: Re‑engineers the European Open Science Cloud (EOSC) architecture for the unique constraints of HPC (security, heterogeneous hardware, batch scheduling).
Metadata schema & registry prototype: Defines a lightweight, extensible metadata set for HPC artifacts and demonstrates a searchable registry that integrates with OLCF’s job submission tools.
Cross‑disciplinary use‑case demonstrations: Shows how the same FAIR component can be reused in climate modeling, genomics, and materials simulations, reducing code duplication.
Guidelines for HPC centers: Provides a roadmap for other supercomputing facilities to adopt FAIR‑oriented services (catalogues, CI pipelines, provenance capture).

Methodology

Requirement gathering – Interviews with OLCF users from three scientific domains identified pain points (environment drift, lack of discoverability, security hurdles).
Design mapping – The authors mapped EOSC‑Life’s FAIR workflow stack (metadata service, component registry, execution engine) onto OLCF’s infrastructure (SLURM scheduler, Cray‑specific modules, authentication layers).
Prototype implementation – Built a minimal viable product consisting of:
- A metadata service exposing a JSON‑LD schema for components.
- A registry UI/API that indexes container images, Singularity definition files, and module files.
- Integration hooks into the sbatch command so users can query the registry at submission time.
Evaluation via case studies – Three representative scientific pipelines were refactored to use the FAIR components, and the team measured reuse frequency, setup time, and reproducibility metrics.

Results & Findings

Metric	Traditional approach	FAIR component approach
Time to set up a new workflow (hrs)	6–12	1–2
Duplicate code artifacts per domain	~15	~3
Success rate of reproducing a published result (first try)	68 %	92 %
User satisfaction (Likert 1‑5)	3.2	4.6

The prototype proved that a modest metadata layer and a searchable registry can slash onboarding time and dramatically improve reproducibility. Moreover, the component‑centric view revealed that many “different” pipelines were actually re‑using the same underlying tools (e.g., a specific FFT library), suggesting a large untapped potential for sharing.

Practical Implications

For developers: Publishing a container image or module file with the prescribed metadata automatically makes it discoverable by anyone on OLCF, turning a personal script into a community asset.
For HPC operators: The registry can be integrated with existing resource managers, enabling policy enforcement (e.g., only approved, FAIR‑tagged components can be scheduled) and simplifying security audits.
For research teams: Reusing vetted components reduces the need for custom environment builds, freeing up compute cycles for actual science rather than “environment engineering.”
Cross‑facility portability: Because the metadata follows community standards (JSON‑LD, schema.org), the same components can be exported to other supercomputers or cloud HPC services with minimal friction.
Automation pipelines: CI/CD systems can automatically validate FAIR compliance (metadata completeness, provenance capture) before a component is promoted to the shared registry, ensuring quality at scale.

Limitations & Future Work

Scope of the prototype – The current implementation covers only a subset of component types (Singularity containers, module files). Extending to compiled binaries, data‑intensive libraries, and AI models remains work in progress.
Security & policy integration – While the authors outline a path for integrating with OLCF’s authentication, the prototype does not yet enforce fine‑grained access controls or sandboxing for untrusted components.
User adoption barrier – Convincing legacy users to annotate and register existing scripts may require incentives or automated retro‑fitting tools.
Scalability testing – The registry was evaluated on a few dozen components; future work should stress‑test the service with thousands of entries and concurrent queries typical of a large HPC center.
Inter‑center federation – The paper proposes a roadmap for linking FAIR registries across multiple supercomputing sites, but concrete protocols and governance models are still open research questions.

Bottom line: By re‑thinking FAIR not as a data‑only concern but as a component‑level strategy, this work offers a practical blueprint for turning the massive, siloed HPC ecosystems into collaborative, reusable platforms—an evolution that could accelerate scientific discovery while lowering the hidden cost of “environment engineering.”

Authors

Sean R. Wilkinson
Patrick Widener
Sarp Oral
Rafael Ferreira da Silva

Paper Information

arXiv ID: 2512.02818v1
Categories: cs.DC, cs.DL
Published: December 2, 2025
PDF: Download PDF

[Paper] Designing FAIR Workflows at OLCF: Building Scalable and Reusable Ecosystems for HPC Science

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity