[Paper] ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling

Published: (June 9, 2026 at 01:36 AM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.10440v1

Overview

Distributed machine learning (ML) is a key paradigm for today’s large-scale artificial intelligence applications. As model inference arises as an important use case, faithful modeling of latency-sensitive collective communication has never been more important. Capturing the device architecture and modeling control and data paths at high fidelity is therefore a necessity today. Having a common, detailed representation for distributed ML infrastructure is also crucial. We revisit the promising open-source, community-driven simulator: ASTRA-sim. In this work, we identify limitations of the current ASTRA-sim simulator and augment it with new features. To this end, we enable fine-grained, high-fidelity simulation with a standardized infrastructure representation, opening new design space exploration opportunities. We propose the simulation at cache-line-sized load-store granularity, with a detailed graphics processing unit (GPU) execution model, to balance simulation scalability and fidelity. We also introduce InfraGraph, a standardized representation to capture distributed ML network infrastructure in detail. Using the updated ASTRA-sim 3.0 simulator, we showcase interesting design space explorations for designing optimized collective algorithms, network requirements, and GPU architectures.

Key Contributions

This paper presents research in the following areas:

  • cs.DC
  • cs.LG
  • cs.NI

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.DC.

Authors

  • William Won
  • Jinsun Yoo
  • Tuan Ta
  • Moumita Dey
  • Andy Balogh
  • Pradosh Datta
  • Furkan Eris
  • Conor Green
  • Winston Liu
  • Changhai Man
  • Kingshuk Mandal
  • Amos Rai
  • Vinay Ramakrishnaiah
  • Ruchi Shah
  • David Sidler
  • Harsh Sikhwal
  • Hanjiang Wu
  • Tushar Krishna
  • Bradford M. Beckmann

Paper Information

  • arXiv ID: 2606.10440v1
  • Categories: cs.DC, cs.LG, cs.NI
  • Published: June 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »