[Paper] Training Time Prediction for Mixed Precision-based Distributed Training
Source: arXiv - 2604.16145v1
Overview
Training deep‑learning models across multiple GPUs or nodes is now routine, but estimating how long a job will run remains a painful guess‑work exercise. This paper shows that the choice of floating‑point precision—especially mixed‑precision training—can swing total training time by more than a factor of two. Existing time‑prediction tools ignore precision, leading to huge errors. The authors present a precision‑aware predictor that keeps prediction error under 10 % even when mixed‑precision is used.
Key Contributions
- Empirical quantification of how precision settings (FP32, FP16, BF16, mixed) affect distributed training time (up to 2.4× variation).
- Critical analysis of prior time‑prediction models, demonstrating up to 147 % MAPE when precision is omitted.
- Design of a new predictor that incorporates precision‑specific performance characteristics (compute, communication, memory bandwidth).
- Extensive evaluation on several popular models (ResNet‑50, BERT, GPT‑2) across multiple GPU clusters, achieving 9.8 % MAPE on average.
- Open‑source implementation (released under Apache 2.0) that can be plugged into existing job‑scheduling pipelines.
Methodology
-
Data Collection – The authors ran a large suite of experiments on GPU clusters (NVIDIA V100/A100) varying:
- Model architecture (CNNs, Transformers)
- Batch size and learning‑rate schedule
- Precision mode (FP32, FP16, BF16, mixed)
- Number of GPUs / nodes
For each run they recorded per‑iteration compute time, communication latency, and overall epoch duration.
-
Feature Engineering – Besides the usual graph‑level features ( FLOPs, parameter count ), they added precision‑specific metrics:
- Tensor‑core utilization ratio
- Memory‑bandwidth savings from reduced precision
- Overhead of loss‑scaling in mixed precision
-
Modeling Approach – A lightweight regression model (gradient‑boosted trees) was trained to map the feature vector to per‑iteration time. The model is hierarchical: a base predictor for compute, a separate predictor for communication, and a final aggregator that respects the precision‑dependent overlap between the two.
-
Validation – 5‑fold cross‑validation across all experiments, plus a hold‑out test on unseen models (e.g., Vision Transformers) to assess generalization.
The whole pipeline is packaged as a Python library with a simple API:
from precision_time import predict_time
time_est = predict_time(
model="resnet50",
precision="mixed_fp16_fp32",
gpus=8,
batch_size=256
)
Results & Findings
| Setting | Baseline (no precision) MAPE | Precision‑aware MAPE |
|---|---|---|
| FP32 only | 23.4 % | 8.1 % |
| FP16 only | 31.7 % | 9.3 % |
| Mixed‑precision (FP16/FP32) | 147.9 % | 9.8 % |
| Mixed‑precision (BF16/FP32) | 112.5 % | 10.2 % |
- Training time variation: Switching from FP32 to mixed precision reduced wall‑clock time by ≈2.4× on the same hardware.
- Prediction robustness: The new predictor kept error under 10 % across all precision modes, batch sizes, and cluster scales (4–64 GPUs).
- Feature importance: Tensor‑core utilization and memory‑bandwidth savings were the top two contributors, confirming that precision directly influences both compute and communication phases.
Practical Implications
- Cost‑aware scheduling – Cloud platforms can now feed accurate time estimates into spot‑instance bidding or budget‑capped jobs, avoiding over‑provisioning.
- Auto‑ML pipelines – Hyper‑parameter search frameworks can factor in precision choices when estimating total experiment runtime, leading to smarter early‑stopping decisions.
- Resource allocation tools – Cluster managers (e.g., Slurm, Kubernetes) can schedule mixed‑precision jobs on nodes with Tensor‑core‑enabled GPUs, maximizing throughput.
- Developer tooling – The open‑source library can be integrated into popular training scripts (PyTorch Lightning, DeepSpeed) to give developers a “time‑to‑completion” preview before launching large runs.
- Energy efficiency – By accurately predicting the speed‑up from mixed precision, organizations can quantify the energy savings and carbon‑footprint reduction of precision‑aware training.
Limitations & Future Work
- Hardware scope – Experiments were limited to NVIDIA V100/A100 GPUs; extending to AMD GPUs, TPUs, or upcoming Hopper GPUs may require retraining the model.
- Dynamic precision – The predictor assumes a static precision setting per run; future work could handle dynamic precision schedules (e.g., gradually switching from FP16 to FP32).
- Network topology – Only standard Ethernet/InfiniBand fabrics were examined; exotic topologies (e.g., NVLink‑based clusters) could affect communication modeling.
- Model diversity – While the suite covered CNNs and Transformers, niche architectures (graph neural networks, diffusion models) were not evaluated.
The authors plan to broaden the dataset, incorporate real‑time profiling hooks for online prediction adjustments, and explore reinforcement‑learning‑based precision scheduling that jointly optimizes speed, accuracy, and cost.
Authors
- Minchul Kang
- Changyong Shin
- Jinwoo Jeong
- Hyunho Lee
- Younghun Go
- Gyeongmin Kim
- Gyeongsik Yang
- Chuck Yoo
Paper Information
- arXiv ID: 2604.16145v1
- Categories: cs.LG, cs.AI, cs.DC, cs.PF
- Published: April 17, 2026
- PDF: Download PDF