[Paper] Offloading Artificial Intelligence Workloads across the Computing Continuum by means of Active Storage Systems

Published: 2 months ago (December 2, 2025 at 06:04 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.02646v1

Overview

The paper investigates how active storage systems—storage devices that can run code—can be used to spread AI training and inference tasks across the whole computing continuum (edge, fog, and cloud). By moving parts of the workload directly to where the data lives, the authors show measurable gains in memory usage, training speed, and overall resource efficiency, while keeping the barrier to entry low for data scientists.

Key Contributions

Continuum‑aware software architecture that orchestrates AI workload placement across heterogeneous devices (edge, fog, cloud).
Integration of active storage (dataClay) with popular Python AI libraries (e.g., PyTorch, TensorFlow) to enable “compute‑in‑storage” without rewriting models.
Comprehensive evaluation of memory footprint, storage overhead, training time, and accuracy on a set of representative AI tasks (image classification, time‑series forecasting).
Open‑source prototype that demonstrates a practical, low‑effort path for developers to offload parts of their pipelines to storage nodes.
Trade‑off analysis that quantifies when active‑storage offloading is beneficial versus when traditional cloud execution remains preferable.

Methodology

Design of a middleware layer – a thin Python wrapper that intercepts data‑access calls and decides, based on policy (e.g., data size, device capability), whether to execute a computation locally, on a nearby storage node, or in the cloud.
Active storage platform (dataClay) – the authors extend dataClay with custom “service objects” that expose AI primitives (tensor ops, mini‑batch training loops) as remote callable methods.
Benchmark suite – they pick three common AI workloads (ResNet‑18 on CIFAR‑10, LSTM on a synthetic sensor stream, and a small GNN) and run them under three configurations: (a) pure cloud, (b) edge‑only, (c) active‑storage‑augmented continuum.
Metrics collection – memory consumption (peak RAM on the compute node), storage I/O volume, wall‑clock training time, and final model accuracy are logged for each run.
Policy evaluation – simple heuristics (e.g., “offload if input batch > 64 MiB”) are compared against a more sophisticated cost model that accounts for network latency and storage CPU load.

Results & Findings

Configuration	Peak RAM (MiB)	Training Time (min)	Storage I/O (GB)	Accuracy
Cloud only	3,200	45	12.8	92.1 %
Edge only	1,800	68	9.5	91.8 %
Active‑Storage Continuum	1,200	32	8.3	92.0 %

Memory reduction: Offloading the data‑preprocessing and early‑layer convolutions to storage cuts the RAM needed on the compute node by ~60 %.
Training speed: Overall wall‑clock time improves by ~30 % because the storage node processes data in place, eliminating repeated network transfers.
Accuracy impact: Negligible (<0.3 % drop), confirming that moving computation does not degrade model quality.
Scalability: Adding more storage nodes linearly reduces training time up to a point; beyond 4 nodes, network contention offsets gains.

Practical Implications

For ML engineers: You can keep existing PyTorch/TensorFlow codebases and simply wrap data loaders with the provided Python SDK to reap active‑storage benefits. No model rewriting is required.
Edge‑centric deployments: Devices with limited RAM (e.g., IoT gateways) can now run larger models by delegating heavy tensor ops to nearby NVMe‑based storage appliances that expose compute kernels.
Cost optimization: Reducing data movement translates to lower bandwidth bills and less pressure on cloud compute instances, making “pay‑as‑you‑go” AI pipelines more economical.
Rapid prototyping: Because the architecture is built on mainstream Python libraries, data scientists can experiment with new algorithms without worrying about the underlying hardware topology.
Vendor relevance: Storage vendors that embed GPUs/TPUs or FPGA accelerators can differentiate their products by offering an “AI‑ready” API layer, opening new revenue streams.

Limitations & Future Work

Hardware dependency: The gains hinge on storage nodes that expose sufficient compute resources (e.g., CPUs with SIMD, optional GPUs). Low‑end SATA drives won’t see the same benefit.
Scheduling simplicity: The current policy engine uses heuristics; a more sophisticated scheduler (reinforcement‑learning‑based or QoS‑aware) could better handle dynamic workloads.
Security & isolation: Executing user code inside storage raises concerns about sandboxing and multi‑tenant isolation, which the prototype does not fully address.
Broader workloads: Experiments focus on relatively small models; scaling to massive transformer‑style networks may expose new bottlenecks (e.g., memory bandwidth on storage CPUs).
Standardization: The authors suggest extending emerging standards (e.g., OpenCAPI, NVMe‑OF) to formalize compute‑in‑storage APIs, a direction they plan to explore.

Authors

Alex Barceló
Sebastián A. Cajas Ordoñez
Jaydeep Samanta
Andrés L. Suárez-Cetrulo
Romila Ghosh
Ricardo Simón Carbajo
Anna Queralt

Paper Information

arXiv ID: 2512.02646v1
Categories: cs.DC
Published: December 2, 2025
PDF: Download PDF

[Paper] Offloading Artificial Intelligence Workloads across the Computing Continuum by means of Active Storage Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity