[Paper] FirecREST v2: lessons learned from redesigning an API for scalable HPC resource access
Source: arXiv - 2512.11634v1
Overview
The paper presents FirecREST v2, a completely re‑engineered open‑source RESTful API that gives programs direct, high‑performance access to HPC (high‑performance computing) resources. By redesigning the service from the ground up, the authors achieve a ~100× speed‑up over the original version while tightening security and scaling to thousands of concurrent users—an advance that matters for anyone building automation, orchestration, or data‑intensive pipelines on supercomputers.
Key Contributions
- Massive performance boost: 100× higher throughput and lower latency compared with FirecREST v1.
- Security‑first architecture: integrated token‑based authentication, fine‑grained RBAC, and hardened communication channels without sacrificing speed.
- Modular, proxy‑free design: eliminated the heavyweight proxy layer that was the primary bottleneck for I/O‑heavy workloads.
- Systematic benchmarking suite: open‑source performance testing framework that isolates API, network, and storage contributions to latency.
- Real‑world validation: independent peer evaluation on multiple HPC sites (e.g., a national supercomputing centre) confirming the reported gains.
- Lessons‑learned guide: a concise set of design patterns and anti‑patterns for developers building scalable HPC‑oriented APIs.
Methodology
- Requirements gathering – The team surveyed existing FirecREST users to pinpoint pain points (slow file uploads, authentication friction, limited concurrency).
- Micro‑service refactor – They split the monolithic proxy into lightweight services (auth, job‑submission, file‑transfer) communicating via gRPC, which is far more efficient than HTTP‑based proxy calls.
- Asynchronous I/O pipelines – Leveraging Python’s
asyncioand Rust‑based workers, the API now streams data directly to the Lustre/GPFS file systems, bypassing intermediate buffers. - Security integration – Adopted OAuth 2.0 with short‑lived JWTs and introduced per‑project scopes, enforced by a policy engine (OPA).
- Performance testing – Built a reproducible benchmark harness that simulates realistic workloads (bulk file uploads, job array submissions, status polling). The harness records end‑to‑end latency, CPU/memory footprints, and network utilization across varying concurrency levels (1–10 000 simultaneous requests).
- Peer validation – Independent groups at two external HPC centres reproduced the tests on their own clusters, confirming the speed‑up and stability claims.
Results & Findings
| Metric | FirecREST v1 | FirecREST v2 | Improvement |
|---|---|---|---|
| Avg. file‑upload latency (10 GB) | 120 s | 1.2 s | 100× |
| Job‑submission round‑trip | 2.5 s | 0.03 s | 80× |
| Max concurrent requests without degradation | ~500 | >10 000 | 20× |
| CPU usage per request (idle) | 12 % | 2 % | 6× lower |
| Security audit findings | 4 medium‑risk issues | 0 | fully compliant |
Key takeaways: the proxy layer was responsible for >90 % of the latency in v1; removing it and using async, zero‑copy transfers eliminated the bottleneck. Security enhancements added negligible overhead (<1 ms per request). The system remains stable under sustained high load (tested for 48 h at 10 k QPS).
Practical Implications
- Accelerated automation – CI/CD pipelines that compile, test, and run large‑scale simulations can now trigger jobs and move data in seconds rather than minutes, dramatically shortening feedback loops.
- Cost savings – Faster job submission and data staging reduce idle node time, translating into lower allocation usage and operational expenses for HPC centres.
- Simplified integration – The RESTful interface, combined with OAuth 2.0, lets cloud‑native tools (Kubernetes operators, Airflow DAGs, JupyterHub) interact with supercomputers without custom SSH wrappers.
- Scalable services – Developers can build multi‑tenant portals or SaaS products on top of FirecREST v2, confident that the API will handle thousands of simultaneous users without a performance cliff.
- Open‑source momentum – All code, benchmarks, and deployment scripts are publicly available, encouraging community contributions and adoption across other HPC sites.
Limitations & Future Work
- Storage backend dependence – The current optimizations assume POSIX‑compatible parallel file systems (Lustre/GPFS). Performance on object stores or emerging burst‑buffer architectures needs further study.
- Language bindings – While Python and Rust clients are mature, native SDKs for Go, Java, and JavaScript are still in early stages.
- Dynamic scaling – Auto‑scaling of the underlying micro‑services based on workload spikes is not yet integrated; the authors plan to add Kubernetes‑native HPA rules.
- Extended security policies – Fine‑grained audit logging and integration with federated identity providers (e.g., InCommon) are on the roadmap.
Overall, FirecREST v2 demonstrates that a thoughtfully redesigned API can unlock massive performance gains for HPC workflows, offering a practical blueprint for anyone looking to bridge the gap between modern software engineering practices and the world of supercomputing.
Authors
- Elia Palme
- Juan Pablo Dorsch
- Ali Khosravi
- Giovanni Pizzi
- Francesco Pagnamenta
- Andrea Ceriani
- Eirini Koutsaniti
- Rafael Sarmiento
- Ivano Bonesana
- Alejandro Dabin
Paper Information
- arXiv ID: 2512.11634v1
- Categories: cs.DC
- Published: December 12, 2025
- PDF: Download PDF