[Paper] Bringing computation to the data: A MOEA-driven approach for optimising data processing in the context of the SKA and SRCNet
Source: arXiv - 2601.01980v1
Overview
The paper tackles one of the biggest data‑processing hurdles facing the Square Kilometre Array (SKA): moving petabytes of raw telescope data across a global network of regional centres is becoming impossible. The authors propose a computation‑to‑data strategy that combines Function‑as‑a‑Service (FaaS) with a Multi‑Objective Evolutionary Algorithm (MOEA) to automatically decide where and how to run data‑intensive tasks, balancing speed, energy use, and data‑transfer costs.
Key Contributions
- Hybrid FaaS + MOEA framework that dynamically generates near‑optimal execution plans for SKA data pipelines.
- Multi‑objective formulation that simultaneously minimizes execution time and energy consumption while respecting data‑location constraints.
- Prototype implementation integrated into the SKA Regional Centres Network (SRCNet) architecture, demonstrating in‑situ function deployment close to data sources.
- Baseline performance evaluation showing up to 30 % reduction in end‑to‑end processing time and 20 % lower energy footprint compared with a centralized processing baseline.
- Open‑source reference code and a reproducible experimental workflow for the broader scientific‑computing community.
Methodology
- Problem Modeling – The data‑processing workflow is expressed as a directed acyclic graph (DAG) where nodes are lightweight functions (e.g., calibration, imaging) and edges represent data dependencies.
- FaaS Layer – Each function is packaged as a container‑based FaaS unit that can be instantiated on any SRCNet node (edge, regional centre, or cloud). The FaaS runtime abstracts storage, networking, and scaling details from the optimizer.
- Decision Engine – A Multi‑Objective Evolutionary Algorithm (specifically NSGA‑II) explores the huge combinatorial space of possible function placements and scheduling orders.
- Objectives: (i) total wall‑clock time, (ii) total energy consumption.
- Constraints: data‑locality (functions must run where required input resides), network bandwidth caps, and node‑specific resource limits.
- Fitness Evaluation – For each candidate solution, a fast simulation model estimates execution time and energy based on historical profiling data of each function on each node type.
- Selection & Deployment – The Pareto‑optimal solutions are presented to a lightweight orchestrator that picks the plan best matching the current service‑level agreement (e.g., prioritize latency during observation bursts). The chosen plan is then materialized by spawning the corresponding FaaS instances across the network.
Results & Findings
| Metric | Centralised (baseline) | MOEA‑driven FaaS (best Pareto) |
|---|---|---|
| End‑to‑end processing time | 1.00 × (reference) | 0.70 × (≈30 % faster) |
| Energy consumption | 1.00 × (reference) | 0.80 × (≈20 % lower) |
| Data transferred over WAN | 100 TB | 45 TB (≈55 % reduction) |
| Scheduler overhead | – | < 2 % of total runtime |
Key take‑aways
- Moving computation to the data cuts WAN traffic dramatically, which in turn reduces both latency and energy spent on data movement.
- The MOEA quickly converges (within a few hundred generations) to solutions that respect all constraints, making it viable for near‑real‑time re‑planning during observation campaigns.
- The modular FaaS approach allows new processing steps to be added without re‑engineering the whole pipeline.
Practical Implications
- For SKA developers: The framework offers a plug‑and‑play way to offload heavy calibration or imaging steps to the nearest edge node, freeing up central resources for other science cases.
- For cloud/edge providers: Demonstrates a concrete use‑case for FaaS beyond typical web workloads, encouraging investment in low‑latency, high‑throughput edge compute platforms.
- Energy‑aware scheduling: Operators can enforce greener operation policies (e.g., shift workloads to nodes powered by renewable energy) simply by adjusting the MOEA’s objective weights.
- Scalable workflow orchestration: The approach can be generalized to other exascale science projects (e.g., climate modelling, genomics) that face similar data‑movement bottlenecks.
- Developer tooling: The open‑source prototype includes a Python SDK for defining DAGs and custom cost models, lowering the barrier for integrating existing SKA pipelines.
Limitations & Future Work
- Simulation fidelity: The current fitness evaluator relies on profiled averages; real‑world variability (e.g., network jitter, node contention) may affect optimality.
- Scalability of the MOEA: While effective for the tested DAG sizes (≈50 functions), larger pipelines may require hierarchical or surrogate‑based optimization to keep runtime low.
- Security & data governance: Deploying functions across heterogeneous sites raises access‑control challenges that are not fully addressed.
- Future directions: The authors plan to (1) integrate online learning to refine cost models on‑the‑fly, (2) explore hybrid meta‑heuristics (e.g., MOEA + reinforcement learning) for faster convergence, and (3) conduct a full‑scale pilot on the operational SRCNet testbed.
Authors
- Manuel Parra‑Royón
- Álvaro Rodríguez‑Gallardo
- Susana Sánchez‑Expósito
- Laura Darriba‑Pol
- Jesús Sánchez‑Castañeda
- M. Ángeles Mendoza
- Julián Garrido
- Javier Moldón
- Lourdes Verdes‑Montenegro
Paper Information
- arXiv ID: 2601.01980v1
- Categories: cs.DC
- Published: January 5, 2026
- PDF: Download PDF