Tutorial: Build an AI-Powered GPU Fleet Optimizer

Published: (April 17, 2026 at 03:00 PM EDT)
7 min read
Source: Dev.to

Source: Dev.to

Introduction

Deploy a serverless LangGraph agent on the DigitalOcean Gradient AI Platform that monitors your GPU fleet using natural‑language queries. The agent scrapes real‑time NVIDIA DCGM metrics (temperature, power, VRAM, engine utilization) from GPU Droplets via Prometheus‑style endpoints on port 9400, detects idle and under‑utilized GPUs, and can trigger actions such as automated power‑off commands. This reduces cloud costs by replacing reactive dashboard monitoring with a proactive AI assistant.

Why GPU Fleet Management Is Hard

  • Cost impact: A single idle GPU Droplet left running overnight can add hundreds of dollars to your monthly bill.
  • Traditional dashboards: Show raw metrics but still require a human to interpret whether a machine is “working” or “wasting money.”

The tutorial walks you through building an AI‑powered GPU fleet optimizer using the DigitalOcean Gradient AI Platform and the Agent Development Kit (ADK). By the end you will be able to:

  • Deploy a serverless, natural‑language AI agent that audits GPU infrastructure in real time.
  • Scrape NVIDIA DCGM metrics (temperature, power draw, VRAM usage, engine utilization).
  • Flag idle resources before they inflate your cloud bill.
  • Fork and customize the blueprint (adjust thresholds, add tools, change the agent’s persona).

Prerequisites

  • DigitalOcean account with at least one active GPU Droplet.
  • DigitalOcean API token (Personal Access Token with read permissions and GenAI scopes).
  • Gradient Model Access Key (generated from the Gradient AI Dashboard).
  • Python 3.12 (recommended for the latest LangGraph and asyncio features).
  • Familiarity with Python, REST APIs, and Linux command‑line basics.

NVIDIA DCGM Metrics

NVIDIA Data Center GPU Manager (DCGM) exposes hardware telemetry through a Prometheus‑compatible exporter on port 9400.

MetricWhat It MeasuresWhy It Matters
DCGM_FI_DEV_GPU_TEMPGPU die temperature in CelsiusHigh temperatures indicate active computation.
DCGM_FI_DEV_POWER_USAGECurrent power draw in wattsIdle GPUs draw significantly less power.
DCGM_FI_DEV_FB_USEDFramebuffer (VRAM) memory in useEmpty VRAM means no models are loaded.
DCGM_FI_DEV_GPU_UTILGPU engine utilization percentageDirect indicator of compute work.

You can query these metrics directly:

curl -s http://<droplet_ip>:9400/metrics \
  | grep -E "DCGM_FI_DEV_GPU_TEMP|DCGM_FI_DEV_POWER_USAGE|DCGM_FI_DEV_FB_USED|DCGM_FI_DEV_GPU_UTIL"

If DCGM is unavailable on a node, the agent falls back to standard CPU/RAM metrics and reports “DCGM Missing”.

Repository Setup

Start with the foundational repository rather than writing everything from scratch.

git clone https://github.com/dosraashid/do-adk-gpu-monitor
cd do-adk-gpu-monitor
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Configure Secrets

Create a .env file in the project root:

DIGITALOCEAN_API_TOKEN="your_do_token"
GRADIENT_MODEL_ACCESS_KEY="your_gradient_key"

Security note: Never commit .env files to version control; the repository’s .gitignore already excludes this file.

Data Flow Overview

  1. User Prompt – Sent to the /run endpoint.
  2. LangGraph State – Conversation memory (thread_id) is managed by MemorySaver for multi‑turn interactions.
  3. Tool Execution – The LLM decides to call @tool functions such as analyze_gpu_fleet().
  4. Parallel Scrapinganalyzer.py uses ThreadPoolExecutor to query the DigitalOcean API and each Droplet’s DCGM endpoint concurrently.
  5. Omniscient Payload – All raw data (temperature, power, VRAM, RAM, CPU, cost) are packaged into a structured JSON dictionary for the LLM.
  6. Synthesis – The LLM reads the JSON and responds in natural language with node names, costs, and recommendations.

For more on stateful AI agents with LangGraph, see the Getting Started with Agentic AI Using LangGraph tutorial.

Customizing the Blueprint

Agent Persona (config.py)

Edit AGENT_SYSTEM_PROMPT to change how the AI communicates.

  • Remove emojis and request raw bullet points for a technical DevOps assistant.
  • Use a management‑focused tone for cost‑summary reports.

Thresholds

The default idle detection uses the following dictionary (shown as a code block for easy copy‑paste):

THRESHOLDS = {
    "gpu": {
        "max_temp_c": 82.0,
        "max_util_percent": 95.0,
        "max_vram_percent": 95.0,

        "idle_util_percent": 2.0,
        "idle_vram_percent": 5.0,

        "optimized_util_percent": 40.0,
        "optimized_vram_percent": 50.0,
    },
    "system": {
        "idle_cpu_percent": 3.0,
        "idle_ram_percent": 15.0,
        "idle_load_15": 0.5,

        "starved_cpu_percent": 85.0,
        "starved_ram_percent": 90.0,

        "optimized_cpu_percent": 40.0,
        "optimized_ram_percent": 50.0,
    },
}

Adjust values to match your workload baseline (e.g., set idle_util_percent to 10.0 if your inference servers typically idle at 8 % GPU utilization).

Droplet Filtering

By default only Droplets with "gpu" in the size_slug are scanned:

target_droplets = [d for d in all_droplets if "gpu" in d.get("size_slug", "").lower()]
  • Change "gpu" to "c-" for CPU‑optimized Droplets.
  • Remove the filter entirely to scan all Droplets.

Adding New Metrics

If you install Prometheus Node Exporter (port 9100) and want to include disk space:

  1. Update metrics.py to scrape disk metrics.
  2. Extend the return dictionary in analyzer.py:
return {
    "droplet_id": droplet_id,
    "gpu_temp": temp_val,
    "gpu_power": power_val,
    "vram_used": vram_val,
    "disk_space_free_gb": disk_val,  # New metric
}

Write‑Access Tools (e.g., Power‑Off)

Add a new @tool function in main.py:

@tool
def power_off_droplet(droplet_id: str) -> str:
    """Power off a Droplet by ID. Use only when the user explicitly asks to stop an idle node."""
    import requests, os

    token = os.getenv("DIGITALOCEAN_API_TOKEN")
    response = requests.post(
        f"https://api.digitalocean.com/v2/droplets/{droplet_id}/actions",
        headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
        json={"type": "power_off"},
    )
    if response.status_code == 201:
        return f"Successfully sent power‑off command to Droplet {droplet_id}."
    return f"Failed to power off Droplet {droplet_id}: {response.status_code} {response.text}"

Bind the new tool to the LLM:

llm_with_tools = llm.bind_tools([analyze_gpu_fleet, power_off_droplet])

Warning: Granting write access requires strict guardrails—confirmation prompts, tag restrictions, and audit logging are recommended.

Running Locally

Start the development server:

gradient agent run

In another terminal, simulate requests:

curl -X POST http://localhost:8080/run \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "Give me a full diagnostic on my GPU nodes including temperature and power.",
           "thread_id": "audit-session-1"
         }'

Follow‑up example (same thread_id):

curl -X POST http://localhost:8080/run \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "Which of those nodes was the most expensive?",
           "thread_id": "audit-session-1"
         }'

Changing the thread_id starts a fresh conversation, demonstrating scoped memory.

Deploying to DigitalOcean Gradient

gradient agent deploy

You’ll receive a public endpoint URL that can be integrated with Slack bots, internal dashboards, CI/CD pipelines, or any HTTP client. The Gradient platform handles scaling for concurrent users.

Comparison: Traditional Dashboards vs. AI Agent

FactorStatic Dashboards (Grafana + Prometheus)AI Agent (This Blueprint)
Setup complexityModerate (requires Prometheus server, Grafana, DCGM exporter)Low (clone repo, set env vars, deploy)
Real‑time alertingRule‑based alerts with fixed thresholdsNatural‑language queries with adaptive reasoning
Multi‑metric correlationManual visual comparisonAutomatic LLM correlation of temperature, power, VRAM, cost
ActionabilityRead‑only; separate automation neededExtensible via @tool for direct API actions
Conversational follow‑upsNot supportedBuilt‑in via LangGraph MemorySaver and thread_id scoping
Best forLarge teams with dedicated SRE/DevOps staff and historical trend analysisSmall‑to‑mid teams needing fast, conversational GPU auditing without full monitoring stack

For fleets under ~20 GPU Droplets, the AI agent eliminates the overhead of a full monitoring stack while still delivering actionable insights. Larger fleets may benefit from running both solutions.

Architectural Considerations

  • Contextual intelligence: MemorySaver provides conversation history, enabling drill‑down questions without re‑scanning the fleet.
  • Parallel processing: ThreadPoolExecutor scans dozens of Droplets concurrently, preventing LLM timeouts.
  • Cost justification: A single idle $500/month GPU instance saved justifies the agent’s inference cost.
  • Graceful degradation: If port 9400 is unreachable, the agent reports “DCGM Missing” and falls back to CPU/RAM metrics.
  • Security: Use read‑only API tokens unless write tools are added; scope permissions carefully and implement audit logging.

Benefits & Use Cases

  1. Catch forgotten resources: Identify GPU Droplets that remain running after experiments or training jobs finish.
  2. Noise reduction: Directly query GPU engine and VRAM utilization, bypassing misleading low‑CPU metrics.
  3. Unified workflow: One natural‑language query replaces multiple UI interactions (DigitalOcean Control Panel, Grafana, architecture diagrams).
  4. Extensible automation: Add tools to power off, resize, or scale resources directly from the conversational interface.

Next Steps & Resources

  • DigitalOcean Gradient AI Platform Documentation – Full reference for deploying and managing AI agents.
  • How to Build Agents Using ADK – Step‑by‑step guide for custom agents.
  • Getting Started with Agentic AI Using LangGraph – Fundamentals of stateful, multi‑step AI agents.
  • Stable Diffusion on DigitalOcean GPU Droplets – Example of GPU‑accelerated AI workloads.
  • Scaling Gradient with GPU Droplets and Networking – Architect production GenAI deployments with GPU Droplets, load balancers, and VPC networking.
0 views
Back to Blog

Related posts

Read more »