One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
Source: DZone DevOps
TL;DR
A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time‑series database, just the same single‑binary agent already running on each machine.
The Problem We Kept Hitting
We’ve been building Ingero — an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single‑node only. Trace one machine, explain what happened on that machine. For single‑GPU inference or training, that worked well.