[Paper] CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

Published: 5 days ago (May 6, 2026 at 12:07 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.04478v1

Overview

Training modern deep‑learning models now routinely spans thousands of GPUs, and the collective communication libraries (CCL) that stitch these devices together become a critical bottleneck. The paper introduces CCL‑D, a diagnostic system that can automatically detect, locate, and report “slow” or “hang” communication anomalies in massive training jobs—something that traditionally required hours of manual debugging. Deployed on a 4,000‑GPU cluster for a full year, CCL‑D cuts the mean‑time‑to‑diagnosis down to under 6 minutes while achieving almost 100 % coverage of known incidents.

Key Contributions

Rank‑level real‑time probing: A lightweight distributed tracing framework that continuously gathers cross‑layer metrics (network latency, kernel execution time, OS‑level counters) without perturbing the training workload.
Intelligent decision analyzer: A data‑driven module that fuses the probe data, applies statistical anomaly detection, and pinpoints the exact GPU rank responsible for the slowdown.
Production‑grade validation: One‑year field study on a 4,000‑GPU cluster, demonstrating near‑complete detection of real‑world slow/hang events and a 10‑× speed‑up over existing diagnostic pipelines.
Open‑source‑ready design: The system is built on standard tracing APIs (e.g., NVIDIA Nsight Systems, Intel VTune) and can be integrated with popular deep‑learning frameworks (PyTorch, TensorFlow) with minimal code changes.

Methodology

Distributed Tracing Probe
- Each GPU rank runs a tiny agent that hooks into the collective communication calls (e.g., NCCL AllReduce, AllGather).
- The agent records timestamps, payload sizes, hardware counters (PCIe bandwidth, NIC queue depth), and OS metrics (CPU utilization, context switches).
- Data are streamed to a central collector using a low‑overhead, back‑pressure‑aware protocol, ensuring the training job’s performance impact stays below 1 %.
Feature Extraction & Cross‑Layer Metrics
- Raw traces are transformed into high‑level indicators such as “average per‑step communication latency,” “variance of kernel execution time,” and “network queue occupancy.”
- These metrics are aligned across ranks to expose outliers that deviate from the cluster‑wide norm.
Intelligent Decision Analyzer
- Anomaly detection: A combination of robust statistical tests (e.g., Median Absolute Deviation) and lightweight machine‑learning models (gradient‑boosted trees) flags suspicious ranks in real time.
- Root‑cause localization: Once an anomaly is detected, the analyzer drills down through the metric hierarchy (hardware → driver → library) to infer the most likely failure point (e.g., NIC congestion, driver deadlock, kernel stall).
Alerting & Reporting
- The system emits a concise JSON payload containing the affected rank, suspected layer, and confidence score, which can be consumed by existing monitoring stacks (Prometheus, Grafana) or automated remediation scripts.

Results & Findings

Metric	Baseline (manual)	CCL‑D
Mean‑time‑to‑detect (MTTD)	2–4 hours (often > 24 h)	≈ 6 minutes
Detection accuracy	70–85 % (missed subtle hangs)	≈ 99 % (near‑complete coverage)
False‑positive rate	10–15 % (noise from normal variance)	< 2 %
Training overhead	N/A (offline)	< 1 % runtime impact

Key observations:

Most slow/hang incidents originated from network‑level back‑pressure (e.g., NIC buffer overflow) rather than pure software bugs.
The rank‑level granularity allowed operators to restart or isolate a single faulty GPU instead of rebooting the whole job, saving up to 30 % of cluster time.
The lightweight probe scaled linearly up to 8,000 GPUs in synthetic tests, confirming its suitability for future exascale clusters.

Practical Implications

Faster debugging cycles: Developers can now receive actionable alerts within minutes, dramatically reducing the “black‑box” time that currently stalls large‑scale experiments.
Automated remediation: The JSON alerts can trigger scripts that automatically migrate the affected rank’s workload, adjust NCCL tuning parameters, or roll back to a known‑good driver version.
Cost savings: By avoiding full‑job restarts, data‑center operators can reclaim GPU hours that would otherwise be lost to prolonged hangs—potentially saving millions of dollars in large‑scale training runs.
Improved reliability for SaaS AI services: Companies offering on‑demand model training (e.g., hyper‑parameter tuning services) can embed CCL‑D into their orchestration layer to guarantee SLA compliance.
Foundation for proactive health monitoring: The same tracing infrastructure can be extended to predict upcoming anomalies (e.g., rising NIC queue depth) before they manifest as hangs, enabling truly predictive maintenance.

Limitations & Future Work

Scope limited to collective communication: CCL‑D focuses on NCCL‑style collectives; point‑to‑point or custom RPC patterns are not yet covered.
Dependence on vendor tracing APIs: The current implementation relies on NVIDIA and Intel tracing hooks; extending to AMD or emerging accelerator stacks will require additional integration work.
Model‑drift in anomaly detector: The statistical thresholds and ML models are tuned on the authors’ cluster; a different hardware topology may need re‑calibration to avoid false positives.
Future directions: The authors plan to (1) broaden support to heterogeneous clusters (CPU‑only, TPU), (2) incorporate reinforcement‑learning‑based auto‑tuning of communication parameters, and (3) open‑source the probe/collector as a plug‑in for popular orchestration frameworks like Kubernetes.

Authors

Yida Gu
Fakang Wang
Jianhao Fu
Zhenhang Sun
Qianyu Zhang
Hairui Zhao
Xingchen Liu
Yang Tian
Wenjing Huang
Zedong Liu
Yifan Chen
Jinwu Yang
Yueyuan Zhou
Qian Zhao
Haoxu Li
Tao Wang
Feng Yu
Zhan Wang
Guangming Tan
Dingwen Tao

Paper Information

arXiv ID: 2605.04478v1
Categories: cs.DC, cs.AI
Published: May 6, 2026
PDF: Download PDF

[Paper] CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction