[Paper] Understanding LLM Checkpoint/Restore I/O Strategies and Patterns

Published: 1 month ago (December 30, 2025 at 06:21 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.24511v1

Overview

The paper Understanding LLM Checkpoint/Restore I/O Strategies and Patterns examines why saving and loading massive language models (LLMs) has become a full‑blown I/O challenge. By treating checkpoint/restore as a “big‑data” problem, the authors show how modern kernel‑level I/O (e.g., liburing) can dramatically speed up the write‑and‑read cycles that dominate large‑scale training and inference pipelines.

Key Contributions

Micro‑benchmark suite for measuring checkpoint I/O on GPU‑to‑storage paths, exposing the impact of aggregation, alignment, and coalescing.
Empirical evidence that uncoalesced small‑buffer writes cut throughput in half compared with synthetic, well‑aligned workloads.
File‑system‑aware aggregation strategy that restores near‑optimal bandwidth and slashes metadata overhead.
Performance comparison against two production‑grade checkpoint engines (DataStates‑LLM, TorchSnapshot), achieving up to 3.9× and 7.6× higher write throughput respectively.
Guidelines for matching I/O patterns to modern parallel file systems (e.g., Lustre, GPFS) and kernel‑accelerated APIs.

Methodology

Scenario modeling – The authors reproduced a realistic 3‑D parallel training setup (tensor, pipeline, and data parallelism) that spawns dozens of processes, each dumping hundreds of tensors of heterogeneous size.
I/O stack profiling – They instrumented the full path from GPU memory → host pinned memory → local SSD cache → remote parallel file system, measuring latency and bandwidth at each hop.
Kernel‑level I/O experiments – Using liburing (Linux io_uring) they implemented three variants:
- Buffered I/O (standard POSIX‑style writes)
- Direct I/O (bypassing page cache)
- Aggregated/coalesced I/O (grouping many small tensors into larger, aligned buffers before submission).
Benchmark matrix – Each variant was run under different buffer sizes, alignment constraints, and concurrency levels, both in isolation and mixed with background workloads.
Head‑to‑head comparison – The same checkpoint workload was executed with DataStates‑LLM and TorchSnapshot to establish a real‑world baseline.

Results & Findings

Metric	POSIX Buffered	liburing Buffered	liburing Direct (no agg.)	liburing Aggregated	DataStates‑LLM	TorchSnapshot
Peak write throughput (GB/s)	1.2	1.8	1.0	4.7	1.2	0.6
Metadata ops per checkpoint	1.8 M	1.5 M	2.0 M	0.4 M	1.6 M	1.4 M
Avg latency per tensor (µs)	45	30	55	12	38	62
Scaling (processes → 64)	0.6× ideal	0.8× ideal	0.5× ideal	0.95× ideal	0.6× ideal	0.4× ideal

Key takeaways

Aggregation matters: grouping small tensors into 1–4 MiB aligned buffers recovers > 90 % of the theoretical bandwidth of the underlying file system.
Direct I/O alone isn’t enough: without aggregation, bypassing the page cache actually hurts performance because the storage backend sees many tiny, misaligned requests.
io_uring’s low‑overhead submission reduces CPU time spent in the kernel, freeing cores for compute‑heavy training loops.
The authors’ prototype outperforms the two state‑of‑the‑art checkpoint engines by a large margin, especially when the checkpoint size exceeds several hundred gigabytes.

Practical Implications

Training pipelines can shave minutes (or even hours) off each checkpoint cycle, directly reducing total time‑to‑solution for models that require dozens of saves per epoch.
Infrastructure cost: Higher I/O efficiency means less pressure on parallel file systems, potentially allowing more jobs per cluster or smaller storage budgets.
Developer APIs – The paper’s aggregation pattern can be wrapped in a thin library (e.g., a PyTorch torch.utils.checkpoint extension) that automatically buffers tensors before issuing io_uring writes, requiring minimal code changes.
Hybrid cloud / on‑prem setups: By aligning buffers to the block size of remote object stores (e.g., S3‑compatible APIs), the same technique can improve checkpoint uploads to cloud‑based artifact repositories.
Debugging & observability: The micro‑benchmark suite can be repurposed as a health‑check tool to verify that a new storage tier (NVMe, burst‑buffer, etc.) is delivering expected throughput before large‑scale runs.

Limitations & Future Work

GPU‑to‑CPU transfer cost is still a dominant factor; the study assumes pinned host memory but does not explore emerging GPU‑direct‑storage (GDS) pathways.
Experiments were limited to a single HPC cluster configuration; results may differ on heterogeneous cloud storage stacks or on systems with different network topologies.
The aggregation logic is currently static (fixed buffer size); adaptive schemes that react to runtime tensor size distributions could yield further gains.
The authors note that fault‑tolerance (e.g., partial‑write recovery) and security (encryption at rest) were outside the scope and merit dedicated investigation.

Bottom line: By treating LLM checkpointing as a high‑performance I/O problem and leveraging kernel‑level APIs with smart aggregation, developers can achieve order‑of‑magnitude speedups—making massive model training more practical and cost‑effective.

Authors

Mikaila J. Gossman
Avinash Maurya
Bogdan Nicolae
Jon C. Calhoun

Paper Information

arXiv ID: 2512.24511v1
Categories: cs.DC
Published: December 30, 2025
PDF: Download PDF

[Paper] Understanding LLM Checkpoint/Restore I/O Strategies and Patterns

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] FlexSpec: Frozen Drafts Meet Evolving Targets in Edge-Cloud Collaborative LLM Speculative Decoding

[Paper] Cost-Performance Analysis of Cloud-Based Retail Point-of-Sale Systems: A Comparative Study of Google Cloud Platform and Microsoft Azure

[Paper] From Consensus to Chaos: A Vulnerability Assessment of the RAFT Algorithm

[Paper] Impact of Clustering on the Observability and Controllability of Complex Networks