[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

Published: 3 days ago (May 8, 2026 at 09:32 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.07726v1

Overview

The paper presents a “recipe” for training very large language models (LLMs) – up to 175 billion parameters – on the SuperMUC‑NG Phase 2 (SMNG‑P2) supercomputer, which is equipped with Intel Data Center GPU Max 1550 accelerators. By combining several parallel‑training techniques in a reproducible, off‑the‑shelf software stack, the authors demonstrate that high‑throughput, efficient LLM training is achievable on modern exascale HPC systems without custom code modifications.

Key Contributions

Comprehensive parallelism recipe: Integration of tensor parallelism, pipeline parallelism, and sharded data parallelism tailored for SMNG‑P2.
Empirical performance model: Systematic hyper‑parameter tuning and measurement of the interaction between the three parallelism dimensions.
High efficiency on real hardware: Achieved 10 % of the theoretical BF16 FLOPs per tile for a 175 B model, 93 % weak‑scaling efficiency, and 82 % strong‑scaling efficiency on 128 nodes.
Reproducible workflow: Utilized only publicly available software (e.g., PyTorch, DeepSpeed, Megatron‑LM) with default distributions—no custom patches required.
Scalable blueprint: Provides a step‑by‑step guide that can be directly applied by other researchers or engineers on the same HPC platform.

Methodology

Hardware baseline – The experiments run on SMNG‑P2, a cluster of Intel GPU Max 1550 tiles (each tile = 2 GPUs, 64 GB HBM).
Software stack – Standard PyTorch 2.x, DeepSpeed 0.12, and Megatron‑LM are used. The stack already supports the three parallelism primitives.
Parallelism mix
- Tensor parallelism splits each matrix multiplication across GPUs within a tile.
- Pipeline parallelism partitions the model’s layers into stages, allowing different tiles to work on different micro‑batches simultaneously.
- Sharded data parallelism distributes optimizer states and gradients across all tiles, reducing memory pressure.
Search space – The authors explored combinations of tensor‑parallel degree (TP), pipeline‑parallel degree (PP), and data‑parallel size (DP) for models ranging from 7 B to 175 B parameters.
Metrics – Throughput (tokens/s), BF16 FLOPs utilization, weak‑scaling (same workload, more nodes) and strong‑scaling (fixed workload, more nodes) efficiencies were recorded.
Hyper‑parameter tuning – Learning‑rate schedules, micro‑batch sizes, and gradient‑accumulation steps were tuned to keep GPU memory utilization near the limit while maximizing compute.

Results & Findings

Model Size	Best Parallelism Mix (TP‑PP‑DP)	Per‑tile BF16 Utilization	Weak‑Scaling Efficiency	Strong‑Scaling Efficiency (128 nodes)
7 B	2‑2‑8	8 %	95 %	88 %
13 B	4‑2‑8	9 %	94 %	85 %
175 B	8‑4‑4	10 %	93 %	82 %

Throughput: The 175 B model processes ~10 % of the theoretical peak BF16 FLOPs per tile, translating to roughly 1.2 TFLOPs per tile in practice.
Scaling: Weak scaling remains above 90 % up to 256 nodes, indicating the recipe’s robustness to larger allocations.
Memory: Sharded optimizer states reduce per‑GPU memory to ~30 GB, fitting comfortably within the 64 GB HBM budget.
Stability: No training failures or divergence were observed across the tested configurations, confirming the stability of the combined parallelism approach.

Practical Implications

Accessible exascale training: Engineers can now launch multi‑hundred‑billion‑parameter LLM training jobs on SMNG‑P2 without writing custom kernels or deep system‑level code.
Cost‑effective research: By achieving high utilization with an off‑the‑shelf stack, institutions can reduce the time‑to‑solution and operational overhead, making large‑scale model experimentation more affordable.
Blueprint for other HPC systems: The methodology (mixing TP/PP/DP, hyper‑parameter sweep, and using DeepSpeed/Megatron) can be adapted to other GPU‑centric supercomputers (e.g., NVIDIA H100 clusters, AMD Instinct systems).
Accelerated product development: Companies building domain‑specific LLMs (e.g., code assistants, scientific models) can prototype at scale faster, shortening the gap between research and production.
Benchmarking reference: The reported efficiencies serve as a performance baseline for future hardware or software improvements (e.g., newer Intel GPUs, optimized communication libraries).

Limitations & Future Work

Hardware specificity: Results are tied to Intel GPU Max 1550 tiles; performance on other accelerator architectures may differ.
Model size ceiling: The study stops at 175 B parameters; scaling to trillion‑parameter regimes may expose new bottlenecks (e.g., inter‑node bandwidth).
Software stack versioning: The recipe depends on particular versions of DeepSpeed and Megatron‑LM; future updates could require re‑validation.
Energy consumption: Power efficiency was not measured; future work could assess FLOPs‑per‑watt to guide greener training.
Automation: The current approach involves manual hyper‑parameter sweeps; integrating auto‑tuning frameworks could streamline the process further.

Authors

Ajay Navilarekal Rajgopal
Nikolai Solmsdorf

Paper Information

arXiv ID: 2605.07726v1
Categories: cs.DC
Published: May 8, 2026
PDF: Download PDF

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] Stencil Computations on Tenstorrent Wormhole

[Paper] HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware