Tutorial: Build an AI-Powered GPU Fleet Optimizer
Source: Dev.to
Introduction
Deploy a serverless LangGraph agent on the DigitalOcean Gradient AI Platform that monitors your GPU fleet using natural‑language queries. The agent scrapes real‑time NVIDIA DCGM metrics (temperature, power, VRAM, engine utilization) from GPU Droplets via Prometheus‑style endpoints on port 9400, detects idle and under‑utilized GPUs, and can trigger actions such as automated power‑off commands. This reduces cloud costs by replacing reactive dashboard monitoring with a proactive AI assistant.
Why GPU Fleet Management Is Hard
- Cost impact: A single idle GPU Droplet left running overnight can add hundreds of dollars to your monthly bill.
- Traditional dashboards: Show raw metrics but still require a human to interpret whether a machine is “working” or “wasting money.”
The tutorial walks you through building an AI‑powered GPU fleet optimizer using the DigitalOcean Gradient AI Platform and the Agent Development Kit (ADK). By the end you will be able to:
- Deploy a serverless, natural‑language AI agent that audits GPU infrastructure in real time.
- Scrape NVIDIA DCGM metrics (temperature, power draw, VRAM usage, engine utilization).
- Flag idle resources before they inflate your cloud bill.
- Fork and customize the blueprint (adjust thresholds, add tools, change the agent’s persona).
Prerequisites
- DigitalOcean account with at least one active GPU Droplet.
- DigitalOcean API token (Personal Access Token with read permissions and GenAI scopes).
- Gradient Model Access Key (generated from the Gradient AI Dashboard).
- Python 3.12 (recommended for the latest LangGraph and asyncio features).
- Familiarity with Python, REST APIs, and Linux command‑line basics.
NVIDIA DCGM Metrics
NVIDIA Data Center GPU Manager (DCGM) exposes hardware telemetry through a Prometheus‑compatible exporter on port 9400.
| Metric | What It Measures | Why It Matters |
|---|---|---|
DCGM_FI_DEV_GPU_TEMP | GPU die temperature in Celsius | High temperatures indicate active computation. |
DCGM_FI_DEV_POWER_USAGE | Current power draw in watts | Idle GPUs draw significantly less power. |
DCGM_FI_DEV_FB_USED | Framebuffer (VRAM) memory in use | Empty VRAM means no models are loaded. |
DCGM_FI_DEV_GPU_UTIL | GPU engine utilization percentage | Direct indicator of compute work. |
You can query these metrics directly:
curl -s http://<droplet_ip>:9400/metrics \
| grep -E "DCGM_FI_DEV_GPU_TEMP|DCGM_FI_DEV_POWER_USAGE|DCGM_FI_DEV_FB_USED|DCGM_FI_DEV_GPU_UTIL"
If DCGM is unavailable on a node, the agent falls back to standard CPU/RAM metrics and reports “DCGM Missing”.
Repository Setup
Start with the foundational repository rather than writing everything from scratch.
git clone https://github.com/dosraashid/do-adk-gpu-monitor
cd do-adk-gpu-monitor
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Configure Secrets
Create a .env file in the project root:
DIGITALOCEAN_API_TOKEN="your_do_token"
GRADIENT_MODEL_ACCESS_KEY="your_gradient_key"
Security note: Never commit
.envfiles to version control; the repository’s.gitignorealready excludes this file.
Data Flow Overview
- User Prompt – Sent to the
/runendpoint. - LangGraph State – Conversation memory (
thread_id) is managed byMemorySaverfor multi‑turn interactions. - Tool Execution – The LLM decides to call
@toolfunctions such asanalyze_gpu_fleet(). - Parallel Scraping –
analyzer.pyusesThreadPoolExecutorto query the DigitalOcean API and each Droplet’s DCGM endpoint concurrently. - Omniscient Payload – All raw data (temperature, power, VRAM, RAM, CPU, cost) are packaged into a structured JSON dictionary for the LLM.
- Synthesis – The LLM reads the JSON and responds in natural language with node names, costs, and recommendations.
For more on stateful AI agents with LangGraph, see the Getting Started with Agentic AI Using LangGraph tutorial.
Customizing the Blueprint
Agent Persona (config.py)
Edit AGENT_SYSTEM_PROMPT to change how the AI communicates.
- Remove emojis and request raw bullet points for a technical DevOps assistant.
- Use a management‑focused tone for cost‑summary reports.
Thresholds
The default idle detection uses the following dictionary (shown as a code block for easy copy‑paste):
THRESHOLDS = {
"gpu": {
"max_temp_c": 82.0,
"max_util_percent": 95.0,
"max_vram_percent": 95.0,
"idle_util_percent": 2.0,
"idle_vram_percent": 5.0,
"optimized_util_percent": 40.0,
"optimized_vram_percent": 50.0,
},
"system": {
"idle_cpu_percent": 3.0,
"idle_ram_percent": 15.0,
"idle_load_15": 0.5,
"starved_cpu_percent": 85.0,
"starved_ram_percent": 90.0,
"optimized_cpu_percent": 40.0,
"optimized_ram_percent": 50.0,
},
}
Adjust values to match your workload baseline (e.g., set idle_util_percent to 10.0 if your inference servers typically idle at 8 % GPU utilization).
Droplet Filtering
By default only Droplets with "gpu" in the size_slug are scanned:
target_droplets = [d for d in all_droplets if "gpu" in d.get("size_slug", "").lower()]
- Change
"gpu"to"c-"for CPU‑optimized Droplets. - Remove the filter entirely to scan all Droplets.
Adding New Metrics
If you install Prometheus Node Exporter (port 9100) and want to include disk space:
- Update
metrics.pyto scrape disk metrics. - Extend the return dictionary in
analyzer.py:
return {
"droplet_id": droplet_id,
"gpu_temp": temp_val,
"gpu_power": power_val,
"vram_used": vram_val,
"disk_space_free_gb": disk_val, # New metric
}
Write‑Access Tools (e.g., Power‑Off)
Add a new @tool function in main.py:
@tool
def power_off_droplet(droplet_id: str) -> str:
"""Power off a Droplet by ID. Use only when the user explicitly asks to stop an idle node."""
import requests, os
token = os.getenv("DIGITALOCEAN_API_TOKEN")
response = requests.post(
f"https://api.digitalocean.com/v2/droplets/{droplet_id}/actions",
headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
json={"type": "power_off"},
)
if response.status_code == 201:
return f"Successfully sent power‑off command to Droplet {droplet_id}."
return f"Failed to power off Droplet {droplet_id}: {response.status_code} {response.text}"
Bind the new tool to the LLM:
llm_with_tools = llm.bind_tools([analyze_gpu_fleet, power_off_droplet])
Warning: Granting write access requires strict guardrails—confirmation prompts, tag restrictions, and audit logging are recommended.
Running Locally
Start the development server:
gradient agent run
In another terminal, simulate requests:
curl -X POST http://localhost:8080/run \
-H "Content-Type: application/json" \
-d '{
"prompt": "Give me a full diagnostic on my GPU nodes including temperature and power.",
"thread_id": "audit-session-1"
}'
Follow‑up example (same thread_id):
curl -X POST http://localhost:8080/run \
-H "Content-Type: application/json" \
-d '{
"prompt": "Which of those nodes was the most expensive?",
"thread_id": "audit-session-1"
}'
Changing the thread_id starts a fresh conversation, demonstrating scoped memory.
Deploying to DigitalOcean Gradient
gradient agent deploy
You’ll receive a public endpoint URL that can be integrated with Slack bots, internal dashboards, CI/CD pipelines, or any HTTP client. The Gradient platform handles scaling for concurrent users.
Comparison: Traditional Dashboards vs. AI Agent
| Factor | Static Dashboards (Grafana + Prometheus) | AI Agent (This Blueprint) |
|---|---|---|
| Setup complexity | Moderate (requires Prometheus server, Grafana, DCGM exporter) | Low (clone repo, set env vars, deploy) |
| Real‑time alerting | Rule‑based alerts with fixed thresholds | Natural‑language queries with adaptive reasoning |
| Multi‑metric correlation | Manual visual comparison | Automatic LLM correlation of temperature, power, VRAM, cost |
| Actionability | Read‑only; separate automation needed | Extensible via @tool for direct API actions |
| Conversational follow‑ups | Not supported | Built‑in via LangGraph MemorySaver and thread_id scoping |
| Best for | Large teams with dedicated SRE/DevOps staff and historical trend analysis | Small‑to‑mid teams needing fast, conversational GPU auditing without full monitoring stack |
For fleets under ~20 GPU Droplets, the AI agent eliminates the overhead of a full monitoring stack while still delivering actionable insights. Larger fleets may benefit from running both solutions.
Architectural Considerations
- Contextual intelligence:
MemorySaverprovides conversation history, enabling drill‑down questions without re‑scanning the fleet. - Parallel processing:
ThreadPoolExecutorscans dozens of Droplets concurrently, preventing LLM timeouts. - Cost justification: A single idle $500/month GPU instance saved justifies the agent’s inference cost.
- Graceful degradation: If port 9400 is unreachable, the agent reports “DCGM Missing” and falls back to CPU/RAM metrics.
- Security: Use read‑only API tokens unless write tools are added; scope permissions carefully and implement audit logging.
Benefits & Use Cases
- Catch forgotten resources: Identify GPU Droplets that remain running after experiments or training jobs finish.
- Noise reduction: Directly query GPU engine and VRAM utilization, bypassing misleading low‑CPU metrics.
- Unified workflow: One natural‑language query replaces multiple UI interactions (DigitalOcean Control Panel, Grafana, architecture diagrams).
- Extensible automation: Add tools to power off, resize, or scale resources directly from the conversational interface.
Next Steps & Resources
- DigitalOcean Gradient AI Platform Documentation – Full reference for deploying and managing AI agents.
- How to Build Agents Using ADK – Step‑by‑step guide for custom agents.
- Getting Started with Agentic AI Using LangGraph – Fundamentals of stateful, multi‑step AI agents.
- Stable Diffusion on DigitalOcean GPU Droplets – Example of GPU‑accelerated AI workloads.
- Scaling Gradient with GPU Droplets and Networking – Architect production GenAI deployments with GPU Droplets, load balancers, and VPC networking.