Tutorial: Build an AI-Powered GPU Fleet Optimizer

Published: 2 days ago (April 17, 2026 at 03:00 PM EDT)

7 min read

Source: Dev.to

Introduction

Deploy a serverless LangGraph agent on the DigitalOcean Gradient AI Platform that monitors your GPU fleet using natural‑language queries. The agent scrapes real‑time NVIDIA DCGM metrics (temperature, power, VRAM, engine utilization) from GPU Droplets via Prometheus‑style endpoints on port 9400, detects idle and under‑utilized GPUs, and can trigger actions such as automated power‑off commands. This reduces cloud costs by replacing reactive dashboard monitoring with a proactive AI assistant.

Why GPU Fleet Management Is Hard

Cost impact: A single idle GPU Droplet left running overnight can add hundreds of dollars to your monthly bill.
Traditional dashboards: Show raw metrics but still require a human to interpret whether a machine is “working” or “wasting money.”

The tutorial walks you through building an AI‑powered GPU fleet optimizer using the DigitalOcean Gradient AI Platform and the Agent Development Kit (ADK). By the end you will be able to:

Deploy a serverless, natural‑language AI agent that audits GPU infrastructure in real time.
Scrape NVIDIA DCGM metrics (temperature, power draw, VRAM usage, engine utilization).
Flag idle resources before they inflate your cloud bill.
Fork and customize the blueprint (adjust thresholds, add tools, change the agent’s persona).

Prerequisites

DigitalOcean account with at least one active GPU Droplet.
DigitalOcean API token (Personal Access Token with read permissions and GenAI scopes).
Gradient Model Access Key (generated from the Gradient AI Dashboard).
Python 3.12 (recommended for the latest LangGraph and asyncio features).
Familiarity with Python, REST APIs, and Linux command‑line basics.

NVIDIA DCGM Metrics

NVIDIA Data Center GPU Manager (DCGM) exposes hardware telemetry through a Prometheus‑compatible exporter on port 9400.

Metric	What It Measures	Why It Matters
`DCGM_FI_DEV_GPU_TEMP`	GPU die temperature in Celsius	High temperatures indicate active computation.
`DCGM_FI_DEV_POWER_USAGE`	Current power draw in watts	Idle GPUs draw significantly less power.
`DCGM_FI_DEV_FB_USED`	Framebuffer (VRAM) memory in use	Empty VRAM means no models are loaded.
`DCGM_FI_DEV_GPU_UTIL`	GPU engine utilization percentage	Direct indicator of compute work.

You can query these metrics directly:

curl -s http://<droplet_ip>:9400/metrics \
  | grep -E "DCGM_FI_DEV_GPU_TEMP|DCGM_FI_DEV_POWER_USAGE|DCGM_FI_DEV_FB_USED|DCGM_FI_DEV_GPU_UTIL"

If DCGM is unavailable on a node, the agent falls back to standard CPU/RAM metrics and reports “DCGM Missing”.

Repository Setup

Start with the foundational repository rather than writing everything from scratch.

git clone https://github.com/dosraashid/do-adk-gpu-monitor
cd do-adk-gpu-monitor
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Configure Secrets

Create a .env file in the project root:

DIGITALOCEAN_API_TOKEN="your_do_token"
GRADIENT_MODEL_ACCESS_KEY="your_gradient_key"

Security note: Never commit .env files to version control; the repository’s .gitignore already excludes this file.

Data Flow Overview

User Prompt – Sent to the /run endpoint.
LangGraph State – Conversation memory (thread_id) is managed by MemorySaver for multi‑turn interactions.
Tool Execution – The LLM decides to call @tool functions such as analyze_gpu_fleet().
Parallel Scraping – analyzer.py uses ThreadPoolExecutor to query the DigitalOcean API and each Droplet’s DCGM endpoint concurrently.
Omniscient Payload – All raw data (temperature, power, VRAM, RAM, CPU, cost) are packaged into a structured JSON dictionary for the LLM.
Synthesis – The LLM reads the JSON and responds in natural language with node names, costs, and recommendations.

For more on stateful AI agents with LangGraph, see the Getting Started with Agentic AI Using LangGraph tutorial.

Customizing the Blueprint

Agent Persona (`config.py`)

Edit AGENT_SYSTEM_PROMPT to change how the AI communicates.

Remove emojis and request raw bullet points for a technical DevOps assistant.
Use a management‑focused tone for cost‑summary reports.

Thresholds

The default idle detection uses the following dictionary (shown as a code block for easy copy‑paste):

THRESHOLDS = {
    "gpu": {
        "max_temp_c": 82.0,
        "max_util_percent": 95.0,
        "max_vram_percent": 95.0,

        "idle_util_percent": 2.0,
        "idle_vram_percent": 5.0,

        "optimized_util_percent": 40.0,
        "optimized_vram_percent": 50.0,
    },
    "system": {
        "idle_cpu_percent": 3.0,
        "idle_ram_percent": 15.0,
        "idle_load_15": 0.5,

        "starved_cpu_percent": 85.0,
        "starved_ram_percent": 90.0,

        "optimized_cpu_percent": 40.0,
        "optimized_ram_percent": 50.0,
    },
}

Adjust values to match your workload baseline (e.g., set idle_util_percent to 10.0 if your inference servers typically idle at 8 % GPU utilization).

Droplet Filtering

By default only Droplets with "gpu" in the size_slug are scanned:

target_droplets = [d for d in all_droplets if "gpu" in d.get("size_slug", "").lower()]

Change "gpu" to "c-" for CPU‑optimized Droplets.
Remove the filter entirely to scan all Droplets.

Adding New Metrics

If you install Prometheus Node Exporter (port 9100) and want to include disk space:

Update metrics.py to scrape disk metrics.
Extend the return dictionary in analyzer.py:

return {
    "droplet_id": droplet_id,
    "gpu_temp": temp_val,
    "gpu_power": power_val,
    "vram_used": vram_val,
    "disk_space_free_gb": disk_val,  # New metric
}

Write‑Access Tools (e.g., Power‑Off)

Add a new @tool function in main.py:

@tool
def power_off_droplet(droplet_id: str) -> str:
    """Power off a Droplet by ID. Use only when the user explicitly asks to stop an idle node."""
    import requests, os

    token = os.getenv("DIGITALOCEAN_API_TOKEN")
    response = requests.post(
        f"https://api.digitalocean.com/v2/droplets/{droplet_id}/actions",
        headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
        json={"type": "power_off"},
    )
    if response.status_code == 201:
        return f"Successfully sent power‑off command to Droplet {droplet_id}."
    return f"Failed to power off Droplet {droplet_id}: {response.status_code} {response.text}"

Bind the new tool to the LLM:

llm_with_tools = llm.bind_tools([analyze_gpu_fleet, power_off_droplet])

Warning: Granting write access requires strict guardrails—confirmation prompts, tag restrictions, and audit logging are recommended.

Running Locally

Start the development server:

gradient agent run

In another terminal, simulate requests:

curl -X POST http://localhost:8080/run \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "Give me a full diagnostic on my GPU nodes including temperature and power.",
           "thread_id": "audit-session-1"
         }'

Follow‑up example (same thread_id):

curl -X POST http://localhost:8080/run \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "Which of those nodes was the most expensive?",
           "thread_id": "audit-session-1"
         }'

Changing the thread_id starts a fresh conversation, demonstrating scoped memory.

Deploying to DigitalOcean Gradient

gradient agent deploy

You’ll receive a public endpoint URL that can be integrated with Slack bots, internal dashboards, CI/CD pipelines, or any HTTP client. The Gradient platform handles scaling for concurrent users.

Comparison: Traditional Dashboards vs. AI Agent

Factor	Static Dashboards (Grafana + Prometheus)	AI Agent (This Blueprint)
Setup complexity	Moderate (requires Prometheus server, Grafana, DCGM exporter)	Low (clone repo, set env vars, deploy)
Real‑time alerting	Rule‑based alerts with fixed thresholds	Natural‑language queries with adaptive reasoning
Multi‑metric correlation	Manual visual comparison	Automatic LLM correlation of temperature, power, VRAM, cost
Actionability	Read‑only; separate automation needed	Extensible via `@tool` for direct API actions
Conversational follow‑ups	Not supported	Built‑in via LangGraph `MemorySaver` and `thread_id` scoping
Best for	Large teams with dedicated SRE/DevOps staff and historical trend analysis	Small‑to‑mid teams needing fast, conversational GPU auditing without full monitoring stack

For fleets under ~20 GPU Droplets, the AI agent eliminates the overhead of a full monitoring stack while still delivering actionable insights. Larger fleets may benefit from running both solutions.

Architectural Considerations

Contextual intelligence: MemorySaver provides conversation history, enabling drill‑down questions without re‑scanning the fleet.
Parallel processing: ThreadPoolExecutor scans dozens of Droplets concurrently, preventing LLM timeouts.
Cost justification: A single idle $500/month GPU instance saved justifies the agent’s inference cost.
Graceful degradation: If port 9400 is unreachable, the agent reports “DCGM Missing” and falls back to CPU/RAM metrics.
Security: Use read‑only API tokens unless write tools are added; scope permissions carefully and implement audit logging.

Benefits & Use Cases

Catch forgotten resources: Identify GPU Droplets that remain running after experiments or training jobs finish.
Noise reduction: Directly query GPU engine and VRAM utilization, bypassing misleading low‑CPU metrics.
Unified workflow: One natural‑language query replaces multiple UI interactions (DigitalOcean Control Panel, Grafana, architecture diagrams).
Extensible automation: Add tools to power off, resize, or scale resources directly from the conversational interface.

Next Steps & Resources

DigitalOcean Gradient AI Platform Documentation – Full reference for deploying and managing AI agents.
How to Build Agents Using ADK – Step‑by‑step guide for custom agents.
Getting Started with Agentic AI Using LangGraph – Fundamentals of stateful, multi‑step AI agents.
Stable Diffusion on DigitalOcean GPU Droplets – Example of GPU‑accelerated AI workloads.
Scaling Gradient with GPU Droplets and Networking – Architect production GenAI deployments with GPU Droplets, load balancers, and VPC networking.

Tutorial: Build an AI-Powered GPU Fleet Optimizer

Introduction

Why GPU Fleet Management Is Hard

Prerequisites

NVIDIA DCGM Metrics

Repository Setup

Configure Secrets

Data Flow Overview

Customizing the Blueprint

Agent Persona (`config.py`)

Thresholds

Droplet Filtering

Adding New Metrics

Write‑Access Tools (e.g., Power‑Off)

Running Locally

Deploying to DigitalOcean Gradient

Comparison: Traditional Dashboards vs. AI Agent

Architectural Considerations

Benefits & Use Cases

Next Steps & Resources

Related posts

Launch Day: 7 AI Agents Start Building Startups with $100 Each

The Future: Engineers as AI System Architects

FinOps for AI vs Traditional FinOps: Key Differences Explained

If AI Finally Writes 90% of Code, You Don't Need to Learn So Many Languages

Introduction

Why GPU Fleet Management Is Hard

Prerequisites

NVIDIA DCGM Metrics

Repository Setup

Configure Secrets

Data Flow Overview

Customizing the Blueprint

Agent Persona (config.py)

Thresholds

Droplet Filtering

Adding New Metrics

Write‑Access Tools (e.g., Power‑Off)

Running Locally

Deploying to DigitalOcean Gradient

Comparison: Traditional Dashboards vs. AI Agent

Architectural Considerations

Benefits & Use Cases

Next Steps & Resources

Related posts

Launch Day: 7 AI Agents Start Building Startups with $100 Each

The Future: Engineers as AI System Architects

FinOps for AI vs Traditional FinOps: Key Differences Explained

If AI Finally Writes 90% of Code, You Don't Need to Learn So Many Languages

Agent Persona (`config.py`)