[Paper] Understanding LLM Checkpoint/Restore I/O Strategies and Patterns
Source: arXiv - 2512.24511v1
Overview
The paper Understanding LLM Checkpoint/Restore I/O Strategies and Patterns examines why saving and loading massive language models (LLMs) has become a full‑blown I/O challenge. By treating checkpoint/restore as a “big‑data” problem, the authors show how modern kernel‑level I/O (e.g., liburing) can dramatically speed up the write‑and‑read cycles that dominate large‑scale training and inference pipelines.
Key Contributions
- Micro‑benchmark suite for measuring checkpoint I/O on GPU‑to‑storage paths, exposing the impact of aggregation, alignment, and coalescing.
- Empirical evidence that uncoalesced small‑buffer writes cut throughput in half compared with synthetic, well‑aligned workloads.
- File‑system‑aware aggregation strategy that restores near‑optimal bandwidth and slashes metadata overhead.
- Performance comparison against two production‑grade checkpoint engines (DataStates‑LLM, TorchSnapshot), achieving up to 3.9× and 7.6× higher write throughput respectively.
- Guidelines for matching I/O patterns to modern parallel file systems (e.g., Lustre, GPFS) and kernel‑accelerated APIs.
Methodology
- Scenario modeling – The authors reproduced a realistic 3‑D parallel training setup (tensor, pipeline, and data parallelism) that spawns dozens of processes, each dumping hundreds of tensors of heterogeneous size.
- I/O stack profiling – They instrumented the full path from GPU memory → host pinned memory → local SSD cache → remote parallel file system, measuring latency and bandwidth at each hop.
- Kernel‑level I/O experiments – Using
liburing(Linux io_uring) they implemented three variants:- Buffered I/O (standard POSIX‑style writes)
- Direct I/O (bypassing page cache)
- Aggregated/coalesced I/O (grouping many small tensors into larger, aligned buffers before submission).
- Benchmark matrix – Each variant was run under different buffer sizes, alignment constraints, and concurrency levels, both in isolation and mixed with background workloads.
- Head‑to‑head comparison – The same checkpoint workload was executed with DataStates‑LLM and TorchSnapshot to establish a real‑world baseline.
Results & Findings
| Metric | POSIX Buffered | liburing Buffered | liburing Direct (no agg.) | liburing Aggregated | DataStates‑LLM | TorchSnapshot |
|---|---|---|---|---|---|---|
| Peak write throughput (GB/s) | 1.2 | 1.8 | 1.0 | 4.7 | 1.2 | 0.6 |
| Metadata ops per checkpoint | 1.8 M | 1.5 M | 2.0 M | 0.4 M | 1.6 M | 1.4 M |
| Avg latency per tensor (µs) | 45 | 30 | 55 | 12 | 38 | 62 |
| Scaling (processes → 64) | 0.6× ideal | 0.8× ideal | 0.5× ideal | 0.95× ideal | 0.6× ideal | 0.4× ideal |
Key takeaways
- Aggregation matters: grouping small tensors into 1–4 MiB aligned buffers recovers > 90 % of the theoretical bandwidth of the underlying file system.
- Direct I/O alone isn’t enough: without aggregation, bypassing the page cache actually hurts performance because the storage backend sees many tiny, misaligned requests.
- io_uring’s low‑overhead submission reduces CPU time spent in the kernel, freeing cores for compute‑heavy training loops.
- The authors’ prototype outperforms the two state‑of‑the‑art checkpoint engines by a large margin, especially when the checkpoint size exceeds several hundred gigabytes.
Practical Implications
- Training pipelines can shave minutes (or even hours) off each checkpoint cycle, directly reducing total time‑to‑solution for models that require dozens of saves per epoch.
- Infrastructure cost: Higher I/O efficiency means less pressure on parallel file systems, potentially allowing more jobs per cluster or smaller storage budgets.
- Developer APIs – The paper’s aggregation pattern can be wrapped in a thin library (e.g., a PyTorch
torch.utils.checkpointextension) that automatically buffers tensors before issuingio_uringwrites, requiring minimal code changes. - Hybrid cloud / on‑prem setups: By aligning buffers to the block size of remote object stores (e.g., S3‑compatible APIs), the same technique can improve checkpoint uploads to cloud‑based artifact repositories.
- Debugging & observability: The micro‑benchmark suite can be repurposed as a health‑check tool to verify that a new storage tier (NVMe, burst‑buffer, etc.) is delivering expected throughput before large‑scale runs.
Limitations & Future Work
- GPU‑to‑CPU transfer cost is still a dominant factor; the study assumes pinned host memory but does not explore emerging GPU‑direct‑storage (GDS) pathways.
- Experiments were limited to a single HPC cluster configuration; results may differ on heterogeneous cloud storage stacks or on systems with different network topologies.
- The aggregation logic is currently static (fixed buffer size); adaptive schemes that react to runtime tensor size distributions could yield further gains.
- The authors note that fault‑tolerance (e.g., partial‑write recovery) and security (encryption at rest) were outside the scope and merit dedicated investigation.
Bottom line: By treating LLM checkpointing as a high‑performance I/O problem and leveraging kernel‑level APIs with smart aggregation, developers can achieve order‑of‑magnitude speedups—making massive model training more practical and cost‑effective.
Authors
- Mikaila J. Gossman
- Avinash Maurya
- Bogdan Nicolae
- Jon C. Calhoun
Paper Information
- arXiv ID: 2512.24511v1
- Categories: cs.DC
- Published: December 30, 2025
- PDF: Download PDF