[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models
Source: arXiv - 2605.07726v1
Overview
The paper presents a “recipe” for training very large language models (LLMs) – up to 175 billion parameters – on the SuperMUC‑NG Phase 2 (SMNG‑P2) supercomputer, which is equipped with Intel Data Center GPU Max 1550 accelerators. By combining several parallel‑training techniques in a reproducible, off‑the‑shelf software stack, the authors demonstrate that high‑throughput, efficient LLM training is achievable on modern exascale HPC systems without custom code modifications.
Key Contributions
- Comprehensive parallelism recipe: Integration of tensor parallelism, pipeline parallelism, and sharded data parallelism tailored for SMNG‑P2.
- Empirical performance model: Systematic hyper‑parameter tuning and measurement of the interaction between the three parallelism dimensions.
- High efficiency on real hardware: Achieved 10 % of the theoretical BF16 FLOPs per tile for a 175 B model, 93 % weak‑scaling efficiency, and 82 % strong‑scaling efficiency on 128 nodes.
- Reproducible workflow: Utilized only publicly available software (e.g., PyTorch, DeepSpeed, Megatron‑LM) with default distributions—no custom patches required.
- Scalable blueprint: Provides a step‑by‑step guide that can be directly applied by other researchers or engineers on the same HPC platform.
Methodology
- Hardware baseline – The experiments run on SMNG‑P2, a cluster of Intel GPU Max 1550 tiles (each tile = 2 GPUs, 64 GB HBM).
- Software stack – Standard PyTorch 2.x, DeepSpeed 0.12, and Megatron‑LM are used. The stack already supports the three parallelism primitives.
- Parallelism mix
- Tensor parallelism splits each matrix multiplication across GPUs within a tile.
- Pipeline parallelism partitions the model’s layers into stages, allowing different tiles to work on different micro‑batches simultaneously.
- Sharded data parallelism distributes optimizer states and gradients across all tiles, reducing memory pressure.
- Search space – The authors explored combinations of tensor‑parallel degree (TP), pipeline‑parallel degree (PP), and data‑parallel size (DP) for models ranging from 7 B to 175 B parameters.
- Metrics – Throughput (tokens/s), BF16 FLOPs utilization, weak‑scaling (same workload, more nodes) and strong‑scaling (fixed workload, more nodes) efficiencies were recorded.
- Hyper‑parameter tuning – Learning‑rate schedules, micro‑batch sizes, and gradient‑accumulation steps were tuned to keep GPU memory utilization near the limit while maximizing compute.
Results & Findings
| Model Size | Best Parallelism Mix (TP‑PP‑DP) | Per‑tile BF16 Utilization | Weak‑Scaling Efficiency | Strong‑Scaling Efficiency (128 nodes) |
|---|---|---|---|---|
| 7 B | 2‑2‑8 | 8 % | 95 % | 88 % |
| 13 B | 4‑2‑8 | 9 % | 94 % | 85 % |
| 175 B | 8‑4‑4 | 10 % | 93 % | 82 % |
- Throughput: The 175 B model processes ~10 % of the theoretical peak BF16 FLOPs per tile, translating to roughly 1.2 TFLOPs per tile in practice.
- Scaling: Weak scaling remains above 90 % up to 256 nodes, indicating the recipe’s robustness to larger allocations.
- Memory: Sharded optimizer states reduce per‑GPU memory to ~30 GB, fitting comfortably within the 64 GB HBM budget.
- Stability: No training failures or divergence were observed across the tested configurations, confirming the stability of the combined parallelism approach.
Practical Implications
- Accessible exascale training: Engineers can now launch multi‑hundred‑billion‑parameter LLM training jobs on SMNG‑P2 without writing custom kernels or deep system‑level code.
- Cost‑effective research: By achieving high utilization with an off‑the‑shelf stack, institutions can reduce the time‑to‑solution and operational overhead, making large‑scale model experimentation more affordable.
- Blueprint for other HPC systems: The methodology (mixing TP/PP/DP, hyper‑parameter sweep, and using DeepSpeed/Megatron) can be adapted to other GPU‑centric supercomputers (e.g., NVIDIA H100 clusters, AMD Instinct systems).
- Accelerated product development: Companies building domain‑specific LLMs (e.g., code assistants, scientific models) can prototype at scale faster, shortening the gap between research and production.
- Benchmarking reference: The reported efficiencies serve as a performance baseline for future hardware or software improvements (e.g., newer Intel GPUs, optimized communication libraries).
Limitations & Future Work
- Hardware specificity: Results are tied to Intel GPU Max 1550 tiles; performance on other accelerator architectures may differ.
- Model size ceiling: The study stops at 175 B parameters; scaling to trillion‑parameter regimes may expose new bottlenecks (e.g., inter‑node bandwidth).
- Software stack versioning: The recipe depends on particular versions of DeepSpeed and Megatron‑LM; future updates could require re‑validation.
- Energy consumption: Power efficiency was not measured; future work could assess FLOPs‑per‑watt to guide greener training.
- Automation: The current approach involves manual hyper‑parameter sweeps; integrating auto‑tuning frameworks could streamline the process further.
Authors
- Ajay Navilarekal Rajgopal
- Nikolai Solmsdorf
Paper Information
- arXiv ID: 2605.07726v1
- Categories: cs.DC
- Published: May 8, 2026
- PDF: Download PDF