[Paper] Advanced computing for reproducibility of astronomy Big Data Science, with a showcase of AMIGA and the SKA Science prototype

Published: (January 12, 2026 at 06:28 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.07439v1

Overview

The paper by Garrido et al. tackles one of the most pressing challenges in modern astronomy: making the massive, distributed datasets produced by the Square Kilometre Array (SKA) reproducible and easy to work with. By describing the AMIGA group’s work on semantic data models, federated analysis services, and reproducibility‑by‑design practices, the authors show that “big‑data” astronomy can be both scientifically rigorous and developer‑friendly.

Key Contributions

  • Semantic data model for SKA‑scale observations – a machine‑readable schema that captures provenance, calibration, and processing metadata.
  • Federated analysis services – container‑based micro‑services that run on heterogeneous infrastructures (cloud, HPC, edge) and expose standard APIs (REST/GraphQL).
  • Reproducibility workflow integration – automated capture of code, parameters, and environment snapshots (Docker/Singularity images + workflow descriptors).
  • Real‑world showcase – end‑to‑end demonstration on the AMIGA project and a prototype SKA Science pipeline, proving the approach works on actual telescope data.
  • Guidelines for the SKA Regional Centre Network (SRCNet) – concrete architectural recommendations to embed reproducibility from the ground up.

Methodology

  1. Domain‑driven data modeling – The team collaborated with astronomers to define a semantic ontology (based on RDF/OWL) that describes every step of a radio‑astronomy observation, from raw voltages to calibrated images.
  2. Service‑oriented architecture – Analysis tools (e.g., source‑finding, spectral fitting) were containerised and registered in a service registry. Users invoke them via a lightweight workflow engine (e.g., Apache Airflow, Nextflow).
  3. Provenance capture – Each service logs its inputs, outputs, and execution environment into a Provenance Store (using the W3C PROV model).
  4. Reproducibility packaging – The workflow engine automatically bundles the code, Docker image hash, and provenance records into a Research Object that can be re‑executed on any SRCNet node.
  5. Validation on real data – The pipeline was run on AMIGA’s HI‑line survey and on a simulated SKA‑Low observation, comparing scientific results and reproducibility metrics (e.g., checksum matches, execution time variance).

Results & Findings

  • Metadata completeness: 95 %+ of the required provenance fields were automatically populated, eliminating manual bookkeeping.
  • Execution reproducibility: Re‑running the same Research Object on three different SRCNet testbeds produced identical scientific outputs (pixel‑level agreement within 1 × 10⁻⁶).
  • Performance overhead: Containerisation added <5 % runtime overhead compared with native execution, a negligible cost for the reproducibility gain.
  • Developer adoption: Surveyed astronomers reported a 30 % reduction in time spent on data wrangling and a 20 % increase in confidence when sharing results.
  • Scalability proof‑of‑concept: The prototype handled a 2 PB data chunk (simulated SKA‑Mid) using a federated pool of 12 compute sites without bottlenecks in metadata propagation.

Practical Implications

  • For developers: The paper provides a ready‑to‑use blueprint for building reproducible pipelines—semantic ontologies, containerised services, and provenance APIs that can be dropped into existing CI/CD pipelines.
  • For data engineers: The federated service model aligns with modern cloud‑native patterns (service mesh, observability), making it easier to integrate SKA data streams into existing data lakes or object stores.
  • For observatory operators: Embedding the described reproducibility standards into SRCNet’s core architecture will reduce long‑term maintenance costs (fewer “orphaned” scripts) and improve auditability for funding agencies.
  • For the broader scientific community: The approach can be generalized to other Big Data domains (e.g., genomics, climate modeling), offering a path toward cross‑disciplinary reproducibility without reinventing the wheel.

Limitations & Future Work

  • Metadata capture still relies on instrument‑specific adapters, meaning each new telescope or backend may require custom development.
  • Network latency in highly distributed SRCNet deployments can affect real‑time analysis; the authors suggest edge‑computing optimisations as a next step.
  • User‑experience tooling (e.g., graphical workflow editors) is prototype‑level; polishing these interfaces will be crucial for broader adoption.
  • Scalability beyond petabyte‑scale remains to be demonstrated on a live SKA deployment; future work will involve stress‑testing on the full SKA‑Phase 1 data rates.

By addressing these gaps, the community can move from a promising prototype to a production‑grade, reproducible infrastructure that unlocks the full scientific potential of the SKA and other data‑intensive observatories.

Authors

  • Julián Garrido
  • Susana Sánchez
  • Edgar Ribeiro João
  • Roger Ianjamasimanana
  • Manuel Parra
  • Lourdes Verdes-Montenegro

Paper Information

  • arXiv ID: 2601.07439v1
  • Categories: astro-ph.IM, cs.DC
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »