[Paper] GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

Published: 2 months ago (February 19, 2026 at 03:34 PM EST)

5 min read

Source: arXiv

Source: arXiv

Overview

The paper investigates how to reliably predict GPU memory consumption and utilization when training deep‑learning models—a prerequisite for safely collocating multiple training jobs on the same GPU. By systematically comparing three families of estimators—analytical formulas, CPU‑side profiling tools, and lightweight machine‑learning models—the authors expose the trade‑offs developers face when trying to maximize GPU throughput without triggering out‑of‑memory (OOM) crashes.

Key Contributions

Comprehensive benchmark of representative memory estimators (Horus, PyTorch FakeTensor, and a custom ML‑based predictor) across a synthetic suite of MLPs, CNNs, and Transformers.
Cross‑hardware analysis showing how identical models can have dramatically different memory footprints on different GPU generations.
First systematic study of GPU‑utilization estimation for training workloads, highlighting why utilization is not a simple additive metric.
Open‑source artifacts: the synthetic model dataset, training scripts for the ML estimators, and evaluation pipelines released to the community.
Practical guidelines for choosing an estimator based on accuracy, latency, and integration effort.

Methodology

Synthetic Model Corpus
- Generated a large, controlled set of neural networks varying in depth, width, layer types, and hyper‑parameters.
- Covered three architecture families:
  - Multi‑layer Perceptrons (MLPs)
  - Convolutional Neural Networks (CNNs)
  - Transformer‑style models
Estimator Selection
- Evaluated three representative approaches:
  - Analytical – Horus: computes memory from a static graph and known per‑operator costs.
  - CPU‑side profiling – PyTorch FakeTensor: runs a lightweight forward pass on the host to infer memory needs.
  - ML‑based – Small neural networks (both MLP and Transformer back‑ends) trained on the synthetic corpus to predict memory from model metadata (layer counts, tensor shapes, etc.).
Cross‑GPU Experiments
- Ran each estimator on three GPU architectures (e.g., NVIDIA V100, A100, and RTX 4090) to capture hardware‑specific variance.
Utilization Measurement
- Collected GPU‑utilization traces (SM occupancy, memory bandwidth, and compute utilization) for the same models.
- Attempted to predict these utilization metrics using the same ML framework.
Metrics
- Accuracy – mean absolute percentage error (MAPE).
- Inference latency of each estimator.
- Generalization – ability to predict memory for unseen real‑world models (e.g., ResNet‑50, BERT‑large).

Results & Findings

Estimator	Avg. Memory Error	Avg. Latency (ms)	Cross‑GPU Generalization
Horus (Analytical)	4.2 %	0.8	Poor – needs per‑GPU calibration tables
PyTorch FakeTensor (CPU‑side)	3.1 %	12	Moderate – works across GPUs but adds noticeable overhead
ML‑based (MLP)	5.8 %	1.2	Good on same‑generation GPUs, degrades to ~9 % on newer hardware
ML‑based (Transformer)	5.4 %	1.5	Similar trend as MLP predictor

Key Observations

Memory estimation
- Analytical models are the most accurate when the hardware profile is known, but they break down on newer GPUs without re‑tuning.
- CPU‑side profiling offers the best raw accuracy but incurs a 10–15× slowdown compared to analytical or ML approaches.
- Lightweight ML predictors achieve sub‑6 % error with negligible latency, making them attractive for runtime schedulers, yet they still suffer when the target GPU architecture diverges from the training data.
Utilization prediction
- Utilization is highly non‑additive; a model that fully occupies compute units may still leave memory bandwidth under‑utilized.
- The ML estimator captured coarse trends (high‑compute vs. high‑memory models) but could not reliably predict fine‑grained spikes caused by kernel launches.
Real‑world validation
- When tested on popular benchmarks (ResNet‑50, GPT‑2, YOLOv5), the ML predictor’s error rose to ~8 %, confirming the “cross‑architecture” limitation highlighted by the authors.

Practical Implications

Scheduler designers – Adopt the ML‑based estimator as a fast “first‑pass” filter to reject obviously unsafe collocations, then fall back to a more precise CPU‑side probe for borderline cases.
Cloud providers – Integrate per‑GPU calibration tables into analytical models to retain high accuracy without the latency penalty of FakeTensor.
Developers – Use the released synthetic dataset to fine‑tune a custom predictor for their specific hardware fleet, dramatically reducing OOM incidents when launching multi‑tenant training jobs.
Tooling – Wrap the open‑source code into CI pipelines to automatically flag models that will exceed a given GPU’s memory budget before they ever hit the cluster.

Limitations & Future Work

Limitations

Hardware dependence – All three approaches require at least some knowledge of the target GPU’s memory‑allocation granularity; the ML models struggle to extrapolate to future architectures without retraining.
Utilization modeling – The study treats utilization as a scalar metric, ignoring temporal dynamics (e.g., bursty kernel execution) that affect contention.
Dataset scope – Although diverse, the synthetic corpus does not cover exotic operators (e.g., custom CUDA kernels) that appear in production pipelines.

Future Directions

The authors suggest the following avenues for further research:

Runtime telemetry integration – Continuously adapt ML predictors using live performance data.
Expanded model set – Incorporate real‑world custom layers and operators.
Hybrid estimators – Combine analytical insights with learned corrections to improve accuracy and robustness.

Authors

Ehsan Yousefzadeh‑Asl‑Miandoab
Reza Karimzadeh
Danyal Yorulmaz
Bulat Ibragimov
Pınar Tözün

Paper Information

arXiv ID: 2602.17817v1
Categories: cs.DC
Published: February 19, 2026
PDF: Download PDF