[Paper] CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
Source: arXiv - 2605.04478v1
Overview
Training modern deep‑learning models now routinely spans thousands of GPUs, and the collective communication libraries (CCL) that stitch these devices together become a critical bottleneck. The paper introduces CCL‑D, a diagnostic system that can automatically detect, locate, and report “slow” or “hang” communication anomalies in massive training jobs—something that traditionally required hours of manual debugging. Deployed on a 4,000‑GPU cluster for a full year, CCL‑D cuts the mean‑time‑to‑diagnosis down to under 6 minutes while achieving almost 100 % coverage of known incidents.
Key Contributions
- Rank‑level real‑time probing: A lightweight distributed tracing framework that continuously gathers cross‑layer metrics (network latency, kernel execution time, OS‑level counters) without perturbing the training workload.
- Intelligent decision analyzer: A data‑driven module that fuses the probe data, applies statistical anomaly detection, and pinpoints the exact GPU rank responsible for the slowdown.
- Production‑grade validation: One‑year field study on a 4,000‑GPU cluster, demonstrating near‑complete detection of real‑world slow/hang events and a 10‑× speed‑up over existing diagnostic pipelines.
- Open‑source‑ready design: The system is built on standard tracing APIs (e.g., NVIDIA Nsight Systems, Intel VTune) and can be integrated with popular deep‑learning frameworks (PyTorch, TensorFlow) with minimal code changes.
Methodology
-
Distributed Tracing Probe
- Each GPU rank runs a tiny agent that hooks into the collective communication calls (e.g., NCCL
AllReduce,AllGather). - The agent records timestamps, payload sizes, hardware counters (PCIe bandwidth, NIC queue depth), and OS metrics (CPU utilization, context switches).
- Data are streamed to a central collector using a low‑overhead, back‑pressure‑aware protocol, ensuring the training job’s performance impact stays below 1 %.
- Each GPU rank runs a tiny agent that hooks into the collective communication calls (e.g., NCCL
-
Feature Extraction & Cross‑Layer Metrics
- Raw traces are transformed into high‑level indicators such as “average per‑step communication latency,” “variance of kernel execution time,” and “network queue occupancy.”
- These metrics are aligned across ranks to expose outliers that deviate from the cluster‑wide norm.
-
Intelligent Decision Analyzer
- Anomaly detection: A combination of robust statistical tests (e.g., Median Absolute Deviation) and lightweight machine‑learning models (gradient‑boosted trees) flags suspicious ranks in real time.
- Root‑cause localization: Once an anomaly is detected, the analyzer drills down through the metric hierarchy (hardware → driver → library) to infer the most likely failure point (e.g., NIC congestion, driver deadlock, kernel stall).
-
Alerting & Reporting
- The system emits a concise JSON payload containing the affected rank, suspected layer, and confidence score, which can be consumed by existing monitoring stacks (Prometheus, Grafana) or automated remediation scripts.
Results & Findings
| Metric | Baseline (manual) | CCL‑D |
|---|---|---|
| Mean‑time‑to‑detect (MTTD) | 2–4 hours (often > 24 h) | ≈ 6 minutes |
| Detection accuracy | 70–85 % (missed subtle hangs) | ≈ 99 % (near‑complete coverage) |
| False‑positive rate | 10–15 % (noise from normal variance) | < 2 % |
| Training overhead | N/A (offline) | < 1 % runtime impact |
Key observations:
- Most slow/hang incidents originated from network‑level back‑pressure (e.g., NIC buffer overflow) rather than pure software bugs.
- The rank‑level granularity allowed operators to restart or isolate a single faulty GPU instead of rebooting the whole job, saving up to 30 % of cluster time.
- The lightweight probe scaled linearly up to 8,000 GPUs in synthetic tests, confirming its suitability for future exascale clusters.
Practical Implications
- Faster debugging cycles: Developers can now receive actionable alerts within minutes, dramatically reducing the “black‑box” time that currently stalls large‑scale experiments.
- Automated remediation: The JSON alerts can trigger scripts that automatically migrate the affected rank’s workload, adjust NCCL tuning parameters, or roll back to a known‑good driver version.
- Cost savings: By avoiding full‑job restarts, data‑center operators can reclaim GPU hours that would otherwise be lost to prolonged hangs—potentially saving millions of dollars in large‑scale training runs.
- Improved reliability for SaaS AI services: Companies offering on‑demand model training (e.g., hyper‑parameter tuning services) can embed CCL‑D into their orchestration layer to guarantee SLA compliance.
- Foundation for proactive health monitoring: The same tracing infrastructure can be extended to predict upcoming anomalies (e.g., rising NIC queue depth) before they manifest as hangs, enabling truly predictive maintenance.
Limitations & Future Work
- Scope limited to collective communication: CCL‑D focuses on NCCL‑style collectives; point‑to‑point or custom RPC patterns are not yet covered.
- Dependence on vendor tracing APIs: The current implementation relies on NVIDIA and Intel tracing hooks; extending to AMD or emerging accelerator stacks will require additional integration work.
- Model‑drift in anomaly detector: The statistical thresholds and ML models are tuned on the authors’ cluster; a different hardware topology may need re‑calibration to avoid false positives.
- Future directions: The authors plan to (1) broaden support to heterogeneous clusters (CPU‑only, TPU), (2) incorporate reinforcement‑learning‑based auto‑tuning of communication parameters, and (3) open‑source the probe/collector as a plug‑in for popular orchestration frameworks like Kubernetes.
Authors
- Yida Gu
- Fakang Wang
- Jianhao Fu
- Zhenhang Sun
- Qianyu Zhang
- Hairui Zhao
- Xingchen Liu
- Yang Tian
- Wenjing Huang
- Zedong Liu
- Yifan Chen
- Jinwu Yang
- Yueyuan Zhou
- Qian Zhao
- Haoxu Li
- Tao Wang
- Feng Yu
- Zhan Wang
- Guangming Tan
- Dingwen Tao
Paper Information
- arXiv ID: 2605.04478v1
- Categories: cs.DC, cs.AI
- Published: May 6, 2026
- PDF: Download PDF