[Paper] Training Time Prediction for Mixed Precision-based Distributed Training

Published: 2 days ago (April 17, 2026 at 11:18 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.16145v1

Overview

Training deep‑learning models across multiple GPUs or nodes is now routine, but estimating how long a job will run remains a painful guess‑work exercise. This paper shows that the choice of floating‑point precision—especially mixed‑precision training—can swing total training time by more than a factor of two. Existing time‑prediction tools ignore precision, leading to huge errors. The authors present a precision‑aware predictor that keeps prediction error under 10 % even when mixed‑precision is used.

Key Contributions

Empirical quantification of how precision settings (FP32, FP16, BF16, mixed) affect distributed training time (up to 2.4× variation).
Critical analysis of prior time‑prediction models, demonstrating up to 147 % MAPE when precision is omitted.
Design of a new predictor that incorporates precision‑specific performance characteristics (compute, communication, memory bandwidth).
Extensive evaluation on several popular models (ResNet‑50, BERT, GPT‑2) across multiple GPU clusters, achieving 9.8 % MAPE on average.
Open‑source implementation (released under Apache 2.0) that can be plugged into existing job‑scheduling pipelines.

Methodology

Data Collection – The authors ran a large suite of experiments on GPU clusters (NVIDIA V100/A100) varying:
- Model architecture (CNNs, Transformers)
- Batch size and learning‑rate schedule
- Precision mode (FP32, FP16, BF16, mixed)
- Number of GPUs / nodes
For each run they recorded per‑iteration compute time, communication latency, and overall epoch duration.
Feature Engineering – Besides the usual graph‑level features ( FLOPs, parameter count ), they added precision‑specific metrics:
- Tensor‑core utilization ratio
- Memory‑bandwidth savings from reduced precision
- Overhead of loss‑scaling in mixed precision
Modeling Approach – A lightweight regression model (gradient‑boosted trees) was trained to map the feature vector to per‑iteration time. The model is hierarchical: a base predictor for compute, a separate predictor for communication, and a final aggregator that respects the precision‑dependent overlap between the two.
Validation – 5‑fold cross‑validation across all experiments, plus a hold‑out test on unseen models (e.g., Vision Transformers) to assess generalization.

The whole pipeline is packaged as a Python library with a simple API:

from precision_time import predict_time
time_est = predict_time(
    model="resnet50",
    precision="mixed_fp16_fp32",
    gpus=8,
    batch_size=256
)

Results & Findings

Setting	Baseline (no precision) MAPE	Precision‑aware MAPE
FP32 only	23.4 %	8.1 %
FP16 only	31.7 %	9.3 %
Mixed‑precision (FP16/FP32)	147.9 %	9.8 %
Mixed‑precision (BF16/FP32)	112.5 %	10.2 %

Training time variation: Switching from FP32 to mixed precision reduced wall‑clock time by ≈2.4× on the same hardware.
Prediction robustness: The new predictor kept error under 10 % across all precision modes, batch sizes, and cluster scales (4–64 GPUs).
Feature importance: Tensor‑core utilization and memory‑bandwidth savings were the top two contributors, confirming that precision directly influences both compute and communication phases.

Practical Implications

Cost‑aware scheduling – Cloud platforms can now feed accurate time estimates into spot‑instance bidding or budget‑capped jobs, avoiding over‑provisioning.
Auto‑ML pipelines – Hyper‑parameter search frameworks can factor in precision choices when estimating total experiment runtime, leading to smarter early‑stopping decisions.
Resource allocation tools – Cluster managers (e.g., Slurm, Kubernetes) can schedule mixed‑precision jobs on nodes with Tensor‑core‑enabled GPUs, maximizing throughput.
Developer tooling – The open‑source library can be integrated into popular training scripts (PyTorch Lightning, DeepSpeed) to give developers a “time‑to‑completion” preview before launching large runs.
Energy efficiency – By accurately predicting the speed‑up from mixed precision, organizations can quantify the energy savings and carbon‑footprint reduction of precision‑aware training.

Limitations & Future Work

Hardware scope – Experiments were limited to NVIDIA V100/A100 GPUs; extending to AMD GPUs, TPUs, or upcoming Hopper GPUs may require retraining the model.
Dynamic precision – The predictor assumes a static precision setting per run; future work could handle dynamic precision schedules (e.g., gradually switching from FP16 to FP32).
Network topology – Only standard Ethernet/InfiniBand fabrics were examined; exotic topologies (e.g., NVLink‑based clusters) could affect communication modeling.
Model diversity – While the suite covered CNNs and Transformers, niche architectures (graph neural networks, diffusion models) were not evaluated.

The authors plan to broaden the dataset, incorporate real‑time profiling hooks for online prediction adjustments, and explore reinforcement‑learning‑based precision scheduling that jointly optimizes speed, accuracy, and cost.

Authors

Minchul Kang
Changyong Shin
Jinwoo Jeong
Hyunho Lee
Younghun Go
Gyeongmin Kim
Gyeongsik Yang
Chuck Yoo

Paper Information

arXiv ID: 2604.16145v1
Categories: cs.LG, cs.AI, cs.DC, cs.PF
Published: April 17, 2026
PDF: Download PDF

[Paper] Training Time Prediction for Mixed Precision-based Distributed Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ASMR-Bench: Auditing for Sabotage in ML Research

[Paper] Geometric regularization of autoencoders via observed stochastic dynamics

[Paper] Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

[Paper] Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design