[Paper] FirecREST v2: lessons learned from redesigning an API for scalable HPC resource access

Published: 1 month ago (December 12, 2025 at 10:14 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.11634v1

Overview

The paper presents FirecREST v2, a completely re‑engineered open‑source RESTful API that gives programs direct, high‑performance access to HPC (high‑performance computing) resources. By redesigning the service from the ground up, the authors achieve a ~100× speed‑up over the original version while tightening security and scaling to thousands of concurrent users—an advance that matters for anyone building automation, orchestration, or data‑intensive pipelines on supercomputers.

Key Contributions

Massive performance boost: 100× higher throughput and lower latency compared with FirecREST v1.
Security‑first architecture: integrated token‑based authentication, fine‑grained RBAC, and hardened communication channels without sacrificing speed.
Modular, proxy‑free design: eliminated the heavyweight proxy layer that was the primary bottleneck for I/O‑heavy workloads.
Systematic benchmarking suite: open‑source performance testing framework that isolates API, network, and storage contributions to latency.
Real‑world validation: independent peer evaluation on multiple HPC sites (e.g., a national supercomputing centre) confirming the reported gains.
Lessons‑learned guide: a concise set of design patterns and anti‑patterns for developers building scalable HPC‑oriented APIs.

Methodology

Requirements gathering – The team surveyed existing FirecREST users to pinpoint pain points (slow file uploads, authentication friction, limited concurrency).
Micro‑service refactor – They split the monolithic proxy into lightweight services (auth, job‑submission, file‑transfer) communicating via gRPC, which is far more efficient than HTTP‑based proxy calls.
Asynchronous I/O pipelines – Leveraging Python’s asyncio and Rust‑based workers, the API now streams data directly to the Lustre/GPFS file systems, bypassing intermediate buffers.
Security integration – Adopted OAuth 2.0 with short‑lived JWTs and introduced per‑project scopes, enforced by a policy engine (OPA).
Performance testing – Built a reproducible benchmark harness that simulates realistic workloads (bulk file uploads, job array submissions, status polling). The harness records end‑to‑end latency, CPU/memory footprints, and network utilization across varying concurrency levels (1–10 000 simultaneous requests).
Peer validation – Independent groups at two external HPC centres reproduced the tests on their own clusters, confirming the speed‑up and stability claims.

Results & Findings

Metric	FirecREST v1	FirecREST v2	Improvement
Avg. file‑upload latency (10 GB)	120 s	1.2 s	100×
Job‑submission round‑trip	2.5 s	0.03 s	80×
Max concurrent requests without degradation	~500	>10 000	20×
CPU usage per request (idle)	12 %	2 %	6× lower
Security audit findings	4 medium‑risk issues	0	fully compliant

Key takeaways: the proxy layer was responsible for >90 % of the latency in v1; removing it and using async, zero‑copy transfers eliminated the bottleneck. Security enhancements added negligible overhead (<1 ms per request). The system remains stable under sustained high load (tested for 48 h at 10 k QPS).

Practical Implications

Accelerated automation – CI/CD pipelines that compile, test, and run large‑scale simulations can now trigger jobs and move data in seconds rather than minutes, dramatically shortening feedback loops.
Cost savings – Faster job submission and data staging reduce idle node time, translating into lower allocation usage and operational expenses for HPC centres.
Simplified integration – The RESTful interface, combined with OAuth 2.0, lets cloud‑native tools (Kubernetes operators, Airflow DAGs, JupyterHub) interact with supercomputers without custom SSH wrappers.
Scalable services – Developers can build multi‑tenant portals or SaaS products on top of FirecREST v2, confident that the API will handle thousands of simultaneous users without a performance cliff.
Open‑source momentum – All code, benchmarks, and deployment scripts are publicly available, encouraging community contributions and adoption across other HPC sites.

Limitations & Future Work

Storage backend dependence – The current optimizations assume POSIX‑compatible parallel file systems (Lustre/GPFS). Performance on object stores or emerging burst‑buffer architectures needs further study.
Language bindings – While Python and Rust clients are mature, native SDKs for Go, Java, and JavaScript are still in early stages.
Dynamic scaling – Auto‑scaling of the underlying micro‑services based on workload spikes is not yet integrated; the authors plan to add Kubernetes‑native HPA rules.
Extended security policies – Fine‑grained audit logging and integration with federated identity providers (e.g., InCommon) are on the roadmap.

Overall, FirecREST v2 demonstrates that a thoughtfully redesigned API can unlock massive performance gains for HPC workflows, offering a practical blueprint for anyone looking to bridge the gap between modern software engineering practices and the world of supercomputing.

Authors

Elia Palme
Juan Pablo Dorsch
Ali Khosravi
Giovanni Pizzi
Francesco Pagnamenta
Andrea Ceriani
Eirini Koutsaniti
Rafael Sarmiento
Ivano Bonesana
Alejandro Dabin

Paper Information

arXiv ID: 2512.11634v1
Categories: cs.DC
Published: December 12, 2025
PDF: Download PDF

[Paper] FirecREST v2: lessons learned from redesigning an API for scalable HPC resource access

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Hypergraph based Multi-Party Payment Channel

[Paper] Stateless Snowflake: A Cloud-Agnostic Distributed ID Generator Using Network-Derived Identity

[Paper] Enhanced Pruning for Distributed Closeness Centrality under Multi-Packet Messaging

[Paper] RollMux: Phase-Level Multiplexing for Disaggregated RL Post-Training