[Paper] CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure

Published: 3 days ago (May 7, 2026 at 12:40 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.06544v1

Overview

The paper introduces CCL‑Bench 1.0, a trace‑based benchmark suite designed to make performance evaluation of large‑language‑model (LLM) infrastructure transparent and reproducible. By capturing the full execution trace of a training step—along with a machine‑readable workload description and launch scripts—the authors give developers the data they need to understand why a particular hardware‑software combo is fast (or slow), not just what the headline number is.

Key Contributions

Trace‑based benchmark format: each data point bundles an execution trace, a YAML “workload card,” and the exact launch scripts used.
Open, community‑extensible toolkit: utilities to parse traces and compute fine‑grained metrics for compute, memory, and communication efficiency.
Empirical insights impossible to extract from summary‑statistic benchmarks, including:
1. Cases where higher compute‑communication overlap actually leads to longer step times, exposing sub‑optimal parallelism.
2. TPU interconnect bandwidth upgrades delivering disproportionately larger speed‑ups than comparable GPU upgrades on small/medium workloads.
3. Up‑to‑3× performance gaps between the best‑tuned configurations of different training frameworks on identical hardware.

Methodology

Workload Selection – The authors chose a representative set of LLM training workloads (varying model size, batch size, and token length) that are common in industry research.
Trace Collection – For each run, they record a low‑overhead trace that logs kernel launches, memory allocations, and inter‑device communication events.
Workload Card (YAML) – A declarative description captures model architecture, hyper‑parameters, hardware topology, and software stack (framework version, compiler flags, communication library).
Toolkit Processing – The open‑source CCL‑Bench toolkit ingests the trace and card, then computes per‑step metrics such as:
- Compute utilization (% of FLOPs delivered vs. theoretical peak)
- Memory bandwidth usage and contention
- Communication volume, latency, and overlap with compute
Comparative Experiments – They systematically vary one dimension at a time (e.g., interconnect bandwidth, framework, parallelism strategy) while keeping everything else constant, enabling causal attribution of performance differences.

Results & Findings

Scenario	Observation	Interpretation
Higher compute‑communication overlap	Overlap ↑ but step time ↑ as well	Indicates that the overlap is achieved by stalling compute (e.g., smaller micro‑batches) rather than truly parallel execution.
Doubling interconnect bandwidth	TPU: ~30 % step‑time reduction; GPU: ~8 % reduction (small/medium models)	TPU’s mesh‑based interconnect is more latency‑sensitive for these workloads; GPUs are bottlenecked elsewhere (e.g., memory).
Framework tuning	Same hardware, same model → PyTorch best config 3× slower than JAX best config	Different default parallelism heuristics and kernel libraries can dominate performance; “out‑of‑the‑box” tuning is insufficient for production.

These findings demonstrate that a single “seconds per step” number hides a lot of nuance. The trace‑based approach lets engineers pinpoint whether they need more compute, better memory layout, or a smarter communication schedule.

Practical Implications

Informed hardware purchases – Companies can simulate the impact of a higher‑bandwidth TPU mesh vs. a GPU NVLink upgrade before committing capital.
Framework selection & tuning – The benchmark makes it clear that the “best” framework depends on the workload; teams can allocate engineering effort to the framework that yields the biggest ROI.
Automated performance regression testing – Because each benchmark entry is reproducible (trace + launch script), CI pipelines can detect when a software update (e.g., a new CUDA version) degrades a specific efficiency metric.
Better scheduling & parallelism strategies – By exposing when compute‑communication overlap is counter‑productive, developers can redesign data‑parallel or pipeline‑parallel schemes to truly hide latency.
Community‑driven benchmarking – The open format encourages contributions from other labs, leading to a richer, more diverse performance dataset that reflects real‑world production workloads.

Limitations & Future Work

Scope of workloads – The current suite focuses on a handful of LLM training configurations; inference workloads, multimodal models, and extremely large clusters are not yet covered.
Trace overhead – Although lightweight, the tracing infrastructure adds a small runtime penalty that could affect ultra‑tight latency measurements.
Hardware diversity – Experiments are limited to a few TPU and GPU generations; newer accelerators (e.g., Habana, Graphcore) will need dedicated adapters.
Automation of metric interpretation – Future versions could integrate ML‑based anomaly detection to automatically flag inefficient overlap patterns or sub‑optimal communication schedules.

CCL‑Bench 1.0 paves the way for a more scientific, data‑driven approach to LLM infrastructure evaluation—exactly the kind of tooling that developers need to turn raw performance numbers into actionable engineering decisions.

Authors

Eric Ding
Byungsoo Oh
Bhaskar Kataria
Kaiwen Guo
Jelena Gvero
Abhishek Vijaya Kumar
Arjun Devraj
Lindsey Bowen
Atharv Sonwane
Emaad Manzoor
Rachee Singh

Paper Information

arXiv ID: 2605.06544v1
Categories: cs.DC, cs.NI
Published: May 7, 2026
PDF: Download PDF

[Paper] CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole