[Paper] Coordinated Cooling and Compute Management for AI Datacenters

Published: (January 12, 2026 at 08:07 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.08113v1

Overview

AI datacenters that host large‑language‑model (LLM) inference are among the most power‑hungry facilities on the planet. While researchers have long tried to squeeze more compute out of GPUs, they have largely ignored the heat those GPUs generate and the cooling systems needed to keep them safe. This paper bridges that gap by jointly modeling compute scheduling and thermal management, then using the model to drive a hierarchical controller that cuts energy use without hurting latency.

Key Contributions

  • Empirical profiling of GPU servers under a wide range of AI workloads and cooling set‑points, exposing the tight coupling between GPU frequency, parallelism, and heat generation.
  • Joint compute‑thermal model that captures both the performance dynamics of LLM inference (parallelism, DVFS) and the thermodynamic response of the datacenter cooling infrastructure.
  • Hierarchical control framework that simultaneously selects optimal GPU parallelism, dynamic voltage‑frequency scaling (DVFS) levels, and cooling actuator settings (e.g., fan speed, chilled‑water flow).
  • Real‑world validation using Azure inference traces and detailed GPU telemetry, showing measurable energy savings while respecting latency Service Level Objectives (SLOs).
  • Open‑source artifact (simulation scripts and model parameters) to enable reproducibility and further research on compute‑thermal co‑optimization.

Methodology

  1. Workload Characterization – The authors collected fine‑grained metrics (GPU utilization, temperature, power draw) from Azure GPU servers running popular LLM inference workloads (e.g., GPT‑2, BERT). They varied two knobs: the number of parallel inference requests (parallelism) and the GPU frequency (via DVFS).
  2. Thermal Modeling – Using the collected data, they built a physics‑inspired model that predicts rack‑level temperature as a function of total GPU power, airflow, and cooling system set‑points. The model is lightweight enough for online control.
  3. Joint Optimization Problem – They formulated a constrained optimization that minimizes total energy (compute + cooling) while keeping request latency below a target SLO. Decision variables are:
    • Parallelism (how many requests each GPU handles concurrently)
    • DVFS frequency (GPU clock speed)
    • Cooling control (fan speed, chilled‑water flow)
  4. Hierarchical Controller – A two‑level controller runs every few seconds:
    • Local layer on each server picks parallelism/DVFS based on current queue length and temperature.
    • Global layer (datacenter‑wide) adjusts cooling set‑points to keep rack temperatures within safe bounds.
  5. Evaluation – The controller was deployed in a trace‑driven simulator fed with real Azure inference logs. Energy consumption, latency, and temperature were compared against baseline policies that only tune compute (no thermal awareness) or only tune cooling (static compute).

Results & Findings

MetricBaseline (compute‑only)Proposed Co‑opt% Improvement
Total energy (compute + cooling)1.00 × 0.78 × 22 % reduction
99‑th‑percentile latency120 ms115 ms4 % lower
Average rack temperature28 °C26 °C2 °C drop
Cooling power share45 % of total35 % of total10 % absolute drop
  • The controller kept latency within the SLO (≤ 120 ms) while shaving a fifth off the overall power draw.
  • By lowering GPU frequencies modestly during high‑temperature periods, the system avoided “thermal throttling” spikes that would otherwise cause latency spikes.
  • Cooling systems operated at lower fan speeds for most of the day, translating into a measurable reduction in carbon‑intensity when the electricity mix is not fully renewable.

Practical Implications

  • Datacenter operators can integrate the hierarchical controller into existing workload managers (Kubernetes, Slurm) to automatically balance performance and cooling, extending hardware life and reducing OPEX.
  • GPU‑focused AI services (e.g., inference‑as‑a‑service platforms) gain a new lever—thermal awareness—to meet strict latency SLAs without over‑provisioning hardware.
  • Hardware vendors may expose richer telemetry (per‑core temperature, fan curves) and finer‑grained DVFS APIs to enable tighter compute‑thermal loops.
  • Sustainability reporting benefits from a clearer attribution of energy savings to joint compute‑cooling optimization, helping firms meet ESG targets.
  • The modeling approach is cloud‑agnostic; it can be ported to on‑premise AI clusters, edge AI boxes, or emerging liquid‑cooled GPU farms.

Limitations & Future Work

  • The thermal model assumes steady‑state airflow and does not capture rapid transients caused by sudden workload spikes or cooling system faults.
  • Experiments were trace‑driven rather than run on a live production cluster; real‑world deployment may reveal integration challenges with existing orchestration tools.
  • The study focuses on GPU‑centric inference; extending the framework to heterogeneous accelerators (TPUs, FPGAs) and to training workloads is left for future research.
  • Future work could explore reinforcement‑learning‑based controllers that adapt to changing ambient conditions and electricity pricing, as well as multi‑objective optimization that jointly minimizes energy, latency, and carbon emissions.

Authors

  • Nardos Belay Abera
  • Yize Chen

Paper Information

  • arXiv ID: 2601.08113v1
  • Categories: eess.SY, cs.DC
  • Published: January 13, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »