[Paper] A High-Throughput AES-GCM Implementation on GPUs for Secure, Policy-Based Access to Massive Astronomical Catalogs

Published: 3 days ago (February 26, 2026 at 09:54 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.23067v1

Overview

The paper introduces a GPU‑powered implementation of AES‑GCM that can encrypt and authenticate petabyte‑scale astronomical image catalogs at line‑rate. By coupling this high‑throughput crypto engine with a flexible policy engine, the authors enable fine‑grained, policy‑based access control during the pre‑publication phase of large sky surveys, where data confidentiality and integrity are critical.

Key Contributions

Parallel tree‑reduction for GHASH: A novel algorithm that converts the inherently sequential GCM authentication hash into a logarithmic‑time, massively parallel operation suitable for GPUs.
GPU‑optimized AES‑GCM pipeline: End‑to‑end integration of the parallel GHASH with AES encryption, achieving throughput comparable to raw GPU memory bandwidth.
Policy‑driven access framework: A lightweight, rule‑based engine that maps user credentials and dataset attributes to encryption keys and access rights, enabling fine‑grained control without sacrificing performance.
Demonstration on petabyte‑scale workloads: Empirical evaluation on real astronomical image catalogs shows the system can sustain >10 GB/s per GPU, scaling linearly across multiple devices.
Open‑source reference implementation: The authors release the core kernels and policy engine under a permissive license, facilitating adoption in other scientific domains.

Methodology

Problem decomposition: The authors isolate the GHASH step as the bottleneck in AES‑GCM on GPUs because it requires a sequential polynomial multiplication over GF(2¹²⁸).
Tree‑reduction design: They restructure GHASH as a binary reduction tree where each leaf processes a block of ciphertext, and intermediate nodes combine partial hashes using pre‑computed multiplication tables. This reduces the depth of the computation from O(n) to O(log n).
GPU kernel engineering:
- Memory layout: Ciphertext blocks are stored in a contiguous buffer to maximize coalesced reads.
- Shared memory tiling: Partial GHASH results are kept in shared memory to avoid global memory traffic during reductions.
- Warp‑level primitives: The implementation leverages warp shuffles for intra‑warp reductions, minimizing synchronization overhead.
Policy engine integration: A lightweight rule engine (based on JSON‑encoded policies) runs on the host CPU, resolves the appropriate encryption key for each request, and streams data to the GPU pipeline.
Benchmarking: The system is evaluated on NVIDIA A100 GPUs using synthetic workloads that mimic the size and access patterns of upcoming surveys (e.g., LSST, Euclid). Throughput, latency, and CPU‑GPU overhead are measured and compared against a baseline CPU‑only OpenSSL implementation.

Results & Findings

Metric	Baseline (CPU‑OpenSSL)	GPU‑AES‑GCM (single A100)	4‑GPU scaling
Encryption + Auth throughput	~1.2 GB/s	10.8 GB/s	≈ 42 GB/s
Latency per 1 GB block	850 ms	78 ms	20 ms (parallel)
GHASH speedup (vs. sequential)	1×	≈ 12×	≈ 48×
Policy lookup overhead	<1 % of total	<0.5 % of total	<0.5 % of total

Key takeaways:

The tree‑reduction cuts GHASH runtime from a linear to a logarithmic profile, eliminating the traditional performance choke point.
Overall AES‑GCM throughput scales almost linearly with the number of GPUs, confirming that the design does not introduce hidden serialization.
The policy engine adds negligible overhead, proving that fine‑grained access control can coexist with high‑speed encryption.

Practical Implications

Astronomy data pipelines: Survey archives can now enforce per‑user or per‑project encryption policies without bottlenecking data ingest or analysis workflows, accelerating the transition from proprietary to FAIR data releases.
Secure cloud storage: The same GPU‑accelerated AES‑GCM engine can be deployed in cloud environments (e.g., as a microservice behind an object store) to provide high‑throughput, authenticated encryption for large binary assets (satellite imagery, genomics, video archives).
Edge‑to‑core analytics: Researchers processing data on GPU‑rich clusters (e.g., HPC centers) can keep data encrypted at rest and only decrypt the minimal slices needed for a given analysis, reducing attack surface.
Policy‑as‑code adoption: By exposing the policy engine via a simple REST/JSON interface, developers can integrate it into CI/CD pipelines, ensuring that any data product shipped to collaborators is automatically wrapped with the correct cryptographic envelope.
Cost efficiency: Leveraging existing GPU resources (often already present for scientific computing) avoids the need for dedicated hardware security modules, lowering operational expenses while delivering comparable or better performance.

Limitations & Future Work

GPU dependency: The performance gains hinge on the availability of modern GPUs; legacy systems without CUDA support cannot benefit.
Key management scope: The paper focuses on symmetric key distribution via the policy engine but does not address full lifecycle management (rotation, revocation) in a distributed setting.
Side‑channel considerations: While the implementation is constant‑time at the algorithmic level, low‑level GPU side‑channel attacks (e.g., power analysis) are not explored.
Broader algorithm support: Future work could extend the tree‑reduction technique to other authenticated modes (e.g., ChaCha20‑Poly1305) or to post‑quantum primitives.
Integration with FAIR metadata services: Linking the policy engine to existing astronomical metadata registries would enable automated policy generation based on dataset provenance and citation requirements.

Authors

Samuel Lemes-Perera
Miguel R. Alarcon
Pino Caballero-Gil
Miquel Serra-Ricart

Paper Information

arXiv ID: 2602.23067v1
Categories: astro-ph.IM, cs.CR, cs.DC
Published: February 26, 2026
PDF: Download PDF

[Paper] A High-Throughput AES-GCM Implementation on GPUs for Secure, Policy-Based Access to Massive Astronomical Catalogs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Exploiting network topology in brain-scale simulations of spiking neural networks

[Paper] STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

[Paper] A Simple Distributed Deterministic Planar Separator

[Paper] Workload Buoyancy: Keeping Apps Afloat by Identifying Shared Resource Bottlenecks