[Paper] Offloading Artificial Intelligence Workloads across the Computing Continuum by means of Active Storage Systems

Published: (December 2, 2025 at 06:04 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.02646v1

Overview

The paper investigates how active storage systems—storage devices that can run code—can be used to spread AI training and inference tasks across the whole computing continuum (edge, fog, and cloud). By moving parts of the workload directly to where the data lives, the authors show measurable gains in memory usage, training speed, and overall resource efficiency, while keeping the barrier to entry low for data scientists.

Key Contributions

  • Continuum‑aware software architecture that orchestrates AI workload placement across heterogeneous devices (edge, fog, cloud).
  • Integration of active storage (dataClay) with popular Python AI libraries (e.g., PyTorch, TensorFlow) to enable “compute‑in‑storage” without rewriting models.
  • Comprehensive evaluation of memory footprint, storage overhead, training time, and accuracy on a set of representative AI tasks (image classification, time‑series forecasting).
  • Open‑source prototype that demonstrates a practical, low‑effort path for developers to offload parts of their pipelines to storage nodes.
  • Trade‑off analysis that quantifies when active‑storage offloading is beneficial versus when traditional cloud execution remains preferable.

Methodology

  1. Design of a middleware layer – a thin Python wrapper that intercepts data‑access calls and decides, based on policy (e.g., data size, device capability), whether to execute a computation locally, on a nearby storage node, or in the cloud.
  2. Active storage platform (dataClay) – the authors extend dataClay with custom “service objects” that expose AI primitives (tensor ops, mini‑batch training loops) as remote callable methods.
  3. Benchmark suite – they pick three common AI workloads (ResNet‑18 on CIFAR‑10, LSTM on a synthetic sensor stream, and a small GNN) and run them under three configurations: (a) pure cloud, (b) edge‑only, (c) active‑storage‑augmented continuum.
  4. Metrics collection – memory consumption (peak RAM on the compute node), storage I/O volume, wall‑clock training time, and final model accuracy are logged for each run.
  5. Policy evaluation – simple heuristics (e.g., “offload if input batch > 64 MiB”) are compared against a more sophisticated cost model that accounts for network latency and storage CPU load.

Results & Findings

ConfigurationPeak RAM (MiB)Training Time (min)Storage I/O (GB)Accuracy
Cloud only3,2004512.892.1 %
Edge only1,800689.591.8 %
Active‑Storage Continuum1,200328.392.0 %
  • Memory reduction: Offloading the data‑preprocessing and early‑layer convolutions to storage cuts the RAM needed on the compute node by ~60 %.
  • Training speed: Overall wall‑clock time improves by ~30 % because the storage node processes data in place, eliminating repeated network transfers.
  • Accuracy impact: Negligible (<0.3 % drop), confirming that moving computation does not degrade model quality.
  • Scalability: Adding more storage nodes linearly reduces training time up to a point; beyond 4 nodes, network contention offsets gains.

Practical Implications

  • For ML engineers: You can keep existing PyTorch/TensorFlow codebases and simply wrap data loaders with the provided Python SDK to reap active‑storage benefits. No model rewriting is required.
  • Edge‑centric deployments: Devices with limited RAM (e.g., IoT gateways) can now run larger models by delegating heavy tensor ops to nearby NVMe‑based storage appliances that expose compute kernels.
  • Cost optimization: Reducing data movement translates to lower bandwidth bills and less pressure on cloud compute instances, making “pay‑as‑you‑go” AI pipelines more economical.
  • Rapid prototyping: Because the architecture is built on mainstream Python libraries, data scientists can experiment with new algorithms without worrying about the underlying hardware topology.
  • Vendor relevance: Storage vendors that embed GPUs/TPUs or FPGA accelerators can differentiate their products by offering an “AI‑ready” API layer, opening new revenue streams.

Limitations & Future Work

  • Hardware dependency: The gains hinge on storage nodes that expose sufficient compute resources (e.g., CPUs with SIMD, optional GPUs). Low‑end SATA drives won’t see the same benefit.
  • Scheduling simplicity: The current policy engine uses heuristics; a more sophisticated scheduler (reinforcement‑learning‑based or QoS‑aware) could better handle dynamic workloads.
  • Security & isolation: Executing user code inside storage raises concerns about sandboxing and multi‑tenant isolation, which the prototype does not fully address.
  • Broader workloads: Experiments focus on relatively small models; scaling to massive transformer‑style networks may expose new bottlenecks (e.g., memory bandwidth on storage CPUs).
  • Standardization: The authors suggest extending emerging standards (e.g., OpenCAPI, NVMe‑OF) to formalize compute‑in‑storage APIs, a direction they plan to explore.

Authors

  • Alex Barceló
  • Sebastián A. Cajas Ordoñez
  • Jaydeep Samanta
  • Andrés L. Suárez-Cetrulo
  • Romila Ghosh
  • Ricardo Simón Carbajo
  • Anna Queralt

Paper Information

  • arXiv ID: 2512.02646v1
  • Categories: cs.DC
  • Published: December 2, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »