[Paper] FirecREST v2: lessons learned from redesigning an API for scalable HPC resource access

Published: (December 12, 2025 at 10:14 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.11634v1

Overview

The paper presents FirecREST v2, a completely re‑engineered open‑source RESTful API that gives programs direct, high‑performance access to HPC (high‑performance computing) resources. By redesigning the service from the ground up, the authors achieve a ~100× speed‑up over the original version while tightening security and scaling to thousands of concurrent users—an advance that matters for anyone building automation, orchestration, or data‑intensive pipelines on supercomputers.

Key Contributions

  • Massive performance boost: 100× higher throughput and lower latency compared with FirecREST v1.
  • Security‑first architecture: integrated token‑based authentication, fine‑grained RBAC, and hardened communication channels without sacrificing speed.
  • Modular, proxy‑free design: eliminated the heavyweight proxy layer that was the primary bottleneck for I/O‑heavy workloads.
  • Systematic benchmarking suite: open‑source performance testing framework that isolates API, network, and storage contributions to latency.
  • Real‑world validation: independent peer evaluation on multiple HPC sites (e.g., a national supercomputing centre) confirming the reported gains.
  • Lessons‑learned guide: a concise set of design patterns and anti‑patterns for developers building scalable HPC‑oriented APIs.

Methodology

  1. Requirements gathering – The team surveyed existing FirecREST users to pinpoint pain points (slow file uploads, authentication friction, limited concurrency).
  2. Micro‑service refactor – They split the monolithic proxy into lightweight services (auth, job‑submission, file‑transfer) communicating via gRPC, which is far more efficient than HTTP‑based proxy calls.
  3. Asynchronous I/O pipelines – Leveraging Python’s asyncio and Rust‑based workers, the API now streams data directly to the Lustre/GPFS file systems, bypassing intermediate buffers.
  4. Security integration – Adopted OAuth 2.0 with short‑lived JWTs and introduced per‑project scopes, enforced by a policy engine (OPA).
  5. Performance testing – Built a reproducible benchmark harness that simulates realistic workloads (bulk file uploads, job array submissions, status polling). The harness records end‑to‑end latency, CPU/memory footprints, and network utilization across varying concurrency levels (1–10 000 simultaneous requests).
  6. Peer validation – Independent groups at two external HPC centres reproduced the tests on their own clusters, confirming the speed‑up and stability claims.

Results & Findings

MetricFirecREST v1FirecREST v2Improvement
Avg. file‑upload latency (10 GB)120 s1.2 s100×
Job‑submission round‑trip2.5 s0.03 s80×
Max concurrent requests without degradation~500>10 00020×
CPU usage per request (idle)12 %2 %6× lower
Security audit findings4 medium‑risk issues0fully compliant

Key takeaways: the proxy layer was responsible for >90 % of the latency in v1; removing it and using async, zero‑copy transfers eliminated the bottleneck. Security enhancements added negligible overhead (<1 ms per request). The system remains stable under sustained high load (tested for 48 h at 10 k QPS).

Practical Implications

  • Accelerated automation – CI/CD pipelines that compile, test, and run large‑scale simulations can now trigger jobs and move data in seconds rather than minutes, dramatically shortening feedback loops.
  • Cost savings – Faster job submission and data staging reduce idle node time, translating into lower allocation usage and operational expenses for HPC centres.
  • Simplified integration – The RESTful interface, combined with OAuth 2.0, lets cloud‑native tools (Kubernetes operators, Airflow DAGs, JupyterHub) interact with supercomputers without custom SSH wrappers.
  • Scalable services – Developers can build multi‑tenant portals or SaaS products on top of FirecREST v2, confident that the API will handle thousands of simultaneous users without a performance cliff.
  • Open‑source momentum – All code, benchmarks, and deployment scripts are publicly available, encouraging community contributions and adoption across other HPC sites.

Limitations & Future Work

  • Storage backend dependence – The current optimizations assume POSIX‑compatible parallel file systems (Lustre/GPFS). Performance on object stores or emerging burst‑buffer architectures needs further study.
  • Language bindings – While Python and Rust clients are mature, native SDKs for Go, Java, and JavaScript are still in early stages.
  • Dynamic scaling – Auto‑scaling of the underlying micro‑services based on workload spikes is not yet integrated; the authors plan to add Kubernetes‑native HPA rules.
  • Extended security policies – Fine‑grained audit logging and integration with federated identity providers (e.g., InCommon) are on the roadmap.

Overall, FirecREST v2 demonstrates that a thoughtfully redesigned API can unlock massive performance gains for HPC workflows, offering a practical blueprint for anyone looking to bridge the gap between modern software engineering practices and the world of supercomputing.

Authors

  • Elia Palme
  • Juan Pablo Dorsch
  • Ali Khosravi
  • Giovanni Pizzi
  • Francesco Pagnamenta
  • Andrea Ceriani
  • Eirini Koutsaniti
  • Rafael Sarmiento
  • Ivano Bonesana
  • Alejandro Dabin

Paper Information

  • arXiv ID: 2512.11634v1
  • Categories: cs.DC
  • Published: December 12, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »