GPU D3cold Power States: How to Brick Your Card Without Trying

Published: (April 24, 2026 at 02:15 PM EDT)
2 min read
Source: Dev.to

Source: Dev.to

Symptom

My NVIDIA Tesla P40 stopped responding after a VM shutdown. No error messages were shown; the GPU remained dead until the host was fully rebooted.

Expected behavior

A clean shutdown of a VM with GPU passthrough should leave the GPU in a ready state. The host is expected to handle power‑state transitions gracefully.

What actually happened

The GPU entered D3cold, a low‑power state it could not exit without a full host reboot. This occurred even after proper VM shutdowns and was especially prevalent on Proxmox 8.4 with kernel 6.8.x and QEMU 8.0.1, where the lack of FLR (Function Level Reset) support on the P40 prevented the host from resetting the GPU.

Fix

  1. Disable D3cold before passing the device through:

    echo 0 > /sys/bus/pci/devices/0000:08:00.0/d3cold_allowed
  2. Pin the GPU’s PCI address with a udev rule to keep it on the same bus after reboot:

    ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{device}=="0x1b80", ATTR{bus}=="0000:08", SYMLINK+="gpu-passthrough"
  3. For Proxmox 8.4 users, explicitly set the machine type in the VM configuration to avoid QEMU assertions:

    -machine q35

These steps keep the GPU on the same PCIe bus and prevent it from entering the D3cold trap.

Why this matters

Non‑FLR GPUs such as the P40 (and similar models like the T4) on Proxmox 8.4 or later are prone to this issue. Without disabling D3cold and pinning the PCI address, the GPU can become permanently unresponsive, effectively “bricked,” until the host is power‑cycled.

The problem is not limited to Proxmox. Any system that lacks FLR support on the GPU and relies on the kernel to manage power states is at risk. The same symptoms have been observed with AMD GPUs under certain conditions, though the mitigation steps differ.

Additional considerations

  • Running the NVIDIA driver on the host (instead of passthrough) can provide a more stable environment, avoiding PCIe bus instability and power‑state issues. The NVIDIA Container Toolkit works well in production for this approach.

  • For AI workloads on Kubernetes or other orchestration platforms that depend on GPU passthrough, ensuring the GPU never enters an unrecoverable power state is critical to avoid node‑wide power cycles.

Takeaway

Disabling D3cold and pinning the GPU’s PCI address are essential steps when using non‑FLR GPUs with passthrough. Implementing these fixes will prevent the GPU from becoming unresponsive after VM shutdowns or reboots.

0 views
Back to Blog

Related posts

Read more »