[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

Published: (May 8, 2026 at 09:32 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.07726v1

Overview

The paper presents a “recipe” for training very large language models (LLMs) – up to 175 billion parameters – on the SuperMUC‑NG Phase 2 (SMNG‑P2) supercomputer, which is equipped with Intel Data Center GPU Max 1550 accelerators. By combining several parallel‑training techniques in a reproducible, off‑the‑shelf software stack, the authors demonstrate that high‑throughput, efficient LLM training is achievable on modern exascale HPC systems without custom code modifications.

Key Contributions

  • Comprehensive parallelism recipe: Integration of tensor parallelism, pipeline parallelism, and sharded data parallelism tailored for SMNG‑P2.
  • Empirical performance model: Systematic hyper‑parameter tuning and measurement of the interaction between the three parallelism dimensions.
  • High efficiency on real hardware: Achieved 10 % of the theoretical BF16 FLOPs per tile for a 175 B model, 93 % weak‑scaling efficiency, and 82 % strong‑scaling efficiency on 128 nodes.
  • Reproducible workflow: Utilized only publicly available software (e.g., PyTorch, DeepSpeed, Megatron‑LM) with default distributions—no custom patches required.
  • Scalable blueprint: Provides a step‑by‑step guide that can be directly applied by other researchers or engineers on the same HPC platform.

Methodology

  1. Hardware baseline – The experiments run on SMNG‑P2, a cluster of Intel GPU Max 1550 tiles (each tile = 2 GPUs, 64 GB HBM).
  2. Software stack – Standard PyTorch 2.x, DeepSpeed 0.12, and Megatron‑LM are used. The stack already supports the three parallelism primitives.
  3. Parallelism mix
    • Tensor parallelism splits each matrix multiplication across GPUs within a tile.
    • Pipeline parallelism partitions the model’s layers into stages, allowing different tiles to work on different micro‑batches simultaneously.
    • Sharded data parallelism distributes optimizer states and gradients across all tiles, reducing memory pressure.
  4. Search space – The authors explored combinations of tensor‑parallel degree (TP), pipeline‑parallel degree (PP), and data‑parallel size (DP) for models ranging from 7 B to 175 B parameters.
  5. Metrics – Throughput (tokens/s), BF16 FLOPs utilization, weak‑scaling (same workload, more nodes) and strong‑scaling (fixed workload, more nodes) efficiencies were recorded.
  6. Hyper‑parameter tuning – Learning‑rate schedules, micro‑batch sizes, and gradient‑accumulation steps were tuned to keep GPU memory utilization near the limit while maximizing compute.

Results & Findings

Model SizeBest Parallelism Mix (TP‑PP‑DP)Per‑tile BF16 UtilizationWeak‑Scaling EfficiencyStrong‑Scaling Efficiency (128 nodes)
7 B2‑2‑88 %95 %88 %
13 B4‑2‑89 %94 %85 %
175 B8‑4‑410 %93 %82 %
  • Throughput: The 175 B model processes ~10 % of the theoretical peak BF16 FLOPs per tile, translating to roughly 1.2 TFLOPs per tile in practice.
  • Scaling: Weak scaling remains above 90 % up to 256 nodes, indicating the recipe’s robustness to larger allocations.
  • Memory: Sharded optimizer states reduce per‑GPU memory to ~30 GB, fitting comfortably within the 64 GB HBM budget.
  • Stability: No training failures or divergence were observed across the tested configurations, confirming the stability of the combined parallelism approach.

Practical Implications

  • Accessible exascale training: Engineers can now launch multi‑hundred‑billion‑parameter LLM training jobs on SMNG‑P2 without writing custom kernels or deep system‑level code.
  • Cost‑effective research: By achieving high utilization with an off‑the‑shelf stack, institutions can reduce the time‑to‑solution and operational overhead, making large‑scale model experimentation more affordable.
  • Blueprint for other HPC systems: The methodology (mixing TP/PP/DP, hyper‑parameter sweep, and using DeepSpeed/Megatron) can be adapted to other GPU‑centric supercomputers (e.g., NVIDIA H100 clusters, AMD Instinct systems).
  • Accelerated product development: Companies building domain‑specific LLMs (e.g., code assistants, scientific models) can prototype at scale faster, shortening the gap between research and production.
  • Benchmarking reference: The reported efficiencies serve as a performance baseline for future hardware or software improvements (e.g., newer Intel GPUs, optimized communication libraries).

Limitations & Future Work

  • Hardware specificity: Results are tied to Intel GPU Max 1550 tiles; performance on other accelerator architectures may differ.
  • Model size ceiling: The study stops at 175 B parameters; scaling to trillion‑parameter regimes may expose new bottlenecks (e.g., inter‑node bandwidth).
  • Software stack versioning: The recipe depends on particular versions of DeepSpeed and Megatron‑LM; future updates could require re‑validation.
  • Energy consumption: Power efficiency was not measured; future work could assess FLOPs‑per‑watt to guide greener training.
  • Automation: The current approach involves manual hyper‑parameter sweeps; integrating auto‑tuning frameworks could streamline the process further.

Authors

  • Ajay Navilarekal Rajgopal
  • Nikolai Solmsdorf

Paper Information

  • arXiv ID: 2605.07726v1
  • Categories: cs.DC
  • Published: May 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »