[Paper] GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations
Source: arXiv
Source: arXiv:2602.17817v1
Overview
The paper investigates how to reliably predict GPU memory consumption and utilization when training deep‑learning models—a prerequisite for safely collocating multiple training jobs on the same GPU. By systematically comparing three families of estimators—analytical formulas, CPU‑side profiling tools, and lightweight machine‑learning models—the authors expose the trade‑offs developers face when trying to maximize GPU throughput without triggering out‑of‑memory (OOM) crashes.
Key Contributions
- Comprehensive benchmark of representative memory estimators (Horus, PyTorch FakeTensor, and a custom ML‑based predictor) across a synthetic suite of MLPs, CNNs, and Transformers.
- Cross‑hardware analysis showing how identical models can have dramatically different memory footprints on different GPU generations.
- First systematic study of GPU‑utilization estimation for training workloads, highlighting why utilization is not a simple additive metric.
- Open‑source artifacts: the synthetic model dataset, training scripts for the ML estimators, and evaluation pipelines released to the community.
- Practical guidelines for choosing an estimator based on accuracy, latency, and integration effort.
Methodology
-
Synthetic Model Corpus
- Generated a large, controlled set of neural networks varying in depth, width, layer types, and hyper‑parameters.
- Covered three architecture families:
- Multi‑layer Perceptrons (MLPs)
- Convolutional Neural Networks (CNNs)
- Transformer‑style models
-
Estimator Selection
- Evaluated three representative approaches:
- Analytical – Horus: computes memory from a static graph and known per‑operator costs.
- CPU‑side profiling – PyTorch FakeTensor: runs a lightweight forward pass on the host to infer memory needs.
- ML‑based – Small neural networks (both MLP and Transformer back‑ends) trained on the synthetic corpus to predict memory from model metadata (layer counts, tensor shapes, etc.).
- Evaluated three representative approaches:
-
Cross‑GPU Experiments
- Ran each estimator on three GPU architectures (e.g., NVIDIA V100, A100, and RTX 4090) to capture hardware‑specific variance.
-
Utilization Measurement
- Collected GPU‑utilization traces (SM occupancy, memory bandwidth, and compute utilization) for the same models.
- Attempted to predict these utilization metrics using the same ML framework.
-
Metrics
- Accuracy – mean absolute percentage error (MAPE).
- Inference latency of each estimator.
- Generalization – ability to predict memory for unseen real‑world models (e.g., ResNet‑50, BERT‑large).
Results & Findings
| Estimator | Avg. Memory Error | Avg. Latency (ms) | Cross‑GPU Generalization |
|---|---|---|---|
| Horus (Analytical) | 4.2 % | 0.8 | Poor – needs per‑GPU calibration tables |
| PyTorch FakeTensor (CPU‑side) | 3.1 % | 12 | Moderate – works across GPUs but adds noticeable overhead |
| ML‑based (MLP) | 5.8 % | 1.2 | Good on same‑generation GPUs, degrades to ~9 % on newer hardware |
| ML‑based (Transformer) | 5.4 % | 1.5 | Similar trend as MLP predictor |
Key Observations
-
Memory estimation
- Analytical models are the most accurate when the hardware profile is known, but they break down on newer GPUs without re‑tuning.
- CPU‑side profiling offers the best raw accuracy but incurs a 10–15× slowdown compared to analytical or ML approaches.
- Lightweight ML predictors achieve sub‑6 % error with negligible latency, making them attractive for runtime schedulers, yet they still suffer when the target GPU architecture diverges from the training data.
-
Utilization prediction
- Utilization is highly non‑additive; a model that fully occupies compute units may still leave memory bandwidth under‑utilized.
- The ML estimator captured coarse trends (high‑compute vs. high‑memory models) but could not reliably predict fine‑grained spikes caused by kernel launches.
-
Real‑world validation
- When tested on popular benchmarks (ResNet‑50, GPT‑2, YOLOv5), the ML predictor’s error rose to ~8 %, confirming the “cross‑architecture” limitation highlighted by the authors.
Practical Implications
- Scheduler designers – Adopt the ML‑based estimator as a fast “first‑pass” filter to reject obviously unsafe collocations, then fall back to a more precise CPU‑side probe for borderline cases.
- Cloud providers – Integrate per‑GPU calibration tables into analytical models to retain high accuracy without the latency penalty of FakeTensor.
- Developers – Use the released synthetic dataset to fine‑tune a custom predictor for their specific hardware fleet, dramatically reducing OOM incidents when launching multi‑tenant training jobs.
- Tooling – Wrap the open‑source code into CI pipelines to automatically flag models that will exceed a given GPU’s memory budget before they ever hit the cluster.
Limitations & Future Work
Limitations
- Hardware dependence – All three approaches require at least some knowledge of the target GPU’s memory‑allocation granularity; the ML models struggle to extrapolate to future architectures without retraining.
- Utilization modeling – The study treats utilization as a scalar metric, ignoring temporal dynamics (e.g., bursty kernel execution) that affect contention.
- Dataset scope – Although diverse, the synthetic corpus does not cover exotic operators (e.g., custom CUDA kernels) that appear in production pipelines.
Future Directions
The authors suggest the following avenues for further research:
- Runtime telemetry integration – Continuously adapt ML predictors using live performance data.
- Expanded model set – Incorporate real‑world custom layers and operators.
- Hybrid estimators – Combine analytical insights with learned corrections to improve accuracy and robustness.
Authors
- Ehsan Yousefzadeh‑Asl‑Miandoab
- Reza Karimzadeh
- Danyal Yorulmaz
- Bulat Ibragimov
- Pınar Tözün
Paper Information
- arXiv ID:
2602.17817v1 - Categories:
cs.DC - Published: February 19, 2026
- PDF: Download PDF