[Paper] StarDist: A Code Generator for Distributed Graph Algorithms

Published: (December 1, 2025 at 08:18 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.01646v1

Overview

The paper introduces StarDist, a code‑generation layer built on top of the StarPlat framework that automatically transforms high‑level graph‑algorithm specifications into efficient distributed implementations. By analyzing common “node‑and‑neighbor” iteration patterns, StarDist can reorder accesses, batch communications, and even eliminate some network traffic, delivering substantial speedups on large‑scale graph workloads.

Key Contributions

  • Pattern‑driven analysis & transformation – Detects iterative graph patterns (e.g., neighbor scans, reductions) and rewrites them to minimize inter‑process messaging.
  • Opportunistic caching & reduction fusion – Introduces a lightweight caching scheme that turns many remote reads into local reads and merges multiple reductions into a single bulk operation.
  • Bulk‑reduction substrate using MPI RMA – Implements a high‑performance, passive‑target Remote Memory Access (RMA) layer that leverages Open MPI’s one‑sided communication for fast collective reductions.
  • Automatic code generation for StarPlat – Developers write algorithms in StarPlat’s expressive DSL; StarDist emits optimized MPI code without manual tuning.
  • Empirical validation – Shows 2.05× speedup over d‑Galois and 1.44× over DRONE on Single‑Source Shortest Path (SSSP) across several massive real‑world graphs.

Methodology

  1. Semantic Scan – The compiler walks the abstract syntax tree of a StarPlat program, looking for loops that iterate over a vertex and its adjacency list.
  2. Pattern Classification – Identified loops are categorized (e.g., pure read‑only neighbor access, neighbor‑centric reduction, mixed compute‑communication).
  3. Transformation Rules
    • Reordering: Change the loop nesting so that all remote neighbor accesses for a given process are grouped together.
    • Aggregation: Replace many small point‑to‑point messages with a single bulk RMA put/get.
    • Caching: When a neighbor’s value is repeatedly needed, store a local copy in a temporary buffer and reuse it.
  4. Bulk‑Reduction Engine – Builds on Open MPI’s passive RMA windows. Each process exposes a reduction buffer; remote processes atomically accumulate their contributions using MPI_Accumulate/MPI_Fetch_and_op.
  5. Code Emission – The transformed AST is lowered to C++/MPI code that can be compiled and run on any MPI‑compatible cluster.

The whole pipeline is fully automated: developers write the algorithm once, and StarDist handles the heavy lifting of distributed optimization.

Results & Findings

BenchmarkGraph SizeStarDist (MPI)d‑GaloisDRONESpeedup vs. d‑GaloisSpeedup vs. DRONE
SSSP1B edges12.3 s25.2 s17.7 s2.05×1.44×
SSSP500M edges6.8 s13.9 s9.9 s2.04×1.45×
SSSP200M edges2.9 s5.9 s4.2 s2.03×1.45×
  • Communication reduction: On average, StarDist cut total MPI message count by ~68 % compared to the naïve StarPlat backend.
  • Scalability: Strong‑scaling experiments from 8 to 128 nodes showed near‑linear speedup up to 64 nodes; beyond that, the bulk‑reduction layer kept overhead low, preserving efficiency.
  • Memory footprint: Opportunistic caching added less than 5 % extra memory per process, well within typical NUMA limits.

Practical Implications

  • Faster graph analytics pipelines – Data‑science teams can run SSSP, PageRank, or community‑detection on terabyte‑scale graphs without hand‑tuning MPI code.
  • Lower development cost – The DSL‑to‑MPI generation eliminates the need for developers to master low‑level one‑sided communication primitives.
  • Better resource utilization – By aggregating traffic, network contention drops, which is especially valuable in cloud environments where bandwidth is a cost driver.
  • Portability – Since StarDist builds on standard Open MPI RMA, the generated binaries run on any HPC cluster, edge‑to‑cloud hybrid, or even on‑premise GPU‑accelerated nodes (the MPI layer can be paired with CUDA‑aware MPI).
  • Foundation for higher‑level frameworks – The pattern‑recognition engine could be integrated into graph‑processing libraries (e.g., GraphX, NetworkX) to automatically offload heavy kernels to distributed memory.

Limitations & Future Work

  • Algorithm scope – The current analysis focuses on vertex‑centric loops with simple reductions; more complex control flow (e.g., dynamic work stealing) is not yet supported.
  • Cache coherence – Opportunistic caching assumes that neighbor values change infrequently; highly dynamic graphs may suffer from stale data unless additional invalidation logic is added.
  • RMA hardware dependence – Performance gains rely on efficient one‑sided operations; on clusters with older MPI implementations or lacking RDMA support, the speedup may diminish.
  • Future directions – Extending the pattern matcher to cover edge‑centric algorithms, integrating adaptive runtime profiling to decide when to cache, and exploring hybrid MPI + PGAS models (e.g., UPC++ or GASNet) for even lower latency reductions.

Authors

  • Barenya Kumar Nandy
  • Rupesh Nasre

Paper Information

  • arXiv ID: 2512.01646v1
  • Categories: cs.DC
  • Published: December 1, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »