[Paper] Understanding LLM Checkpoint/Restore I/O Strategies and Patterns

Published: (December 30, 2025 at 06:21 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.24511v1

Overview

The paper Understanding LLM Checkpoint/Restore I/O Strategies and Patterns examines why saving and loading massive language models (LLMs) has become a full‑blown I/O challenge. By treating checkpoint/restore as a “big‑data” problem, the authors show how modern kernel‑level I/O (e.g., liburing) can dramatically speed up the write‑and‑read cycles that dominate large‑scale training and inference pipelines.

Key Contributions

  • Micro‑benchmark suite for measuring checkpoint I/O on GPU‑to‑storage paths, exposing the impact of aggregation, alignment, and coalescing.
  • Empirical evidence that uncoalesced small‑buffer writes cut throughput in half compared with synthetic, well‑aligned workloads.
  • File‑system‑aware aggregation strategy that restores near‑optimal bandwidth and slashes metadata overhead.
  • Performance comparison against two production‑grade checkpoint engines (DataStates‑LLM, TorchSnapshot), achieving up to 3.9× and 7.6× higher write throughput respectively.
  • Guidelines for matching I/O patterns to modern parallel file systems (e.g., Lustre, GPFS) and kernel‑accelerated APIs.

Methodology

  1. Scenario modeling – The authors reproduced a realistic 3‑D parallel training setup (tensor, pipeline, and data parallelism) that spawns dozens of processes, each dumping hundreds of tensors of heterogeneous size.
  2. I/O stack profiling – They instrumented the full path from GPU memory → host pinned memory → local SSD cache → remote parallel file system, measuring latency and bandwidth at each hop.
  3. Kernel‑level I/O experiments – Using liburing (Linux io_uring) they implemented three variants:
    • Buffered I/O (standard POSIX‑style writes)
    • Direct I/O (bypassing page cache)
    • Aggregated/coalesced I/O (grouping many small tensors into larger, aligned buffers before submission).
  4. Benchmark matrix – Each variant was run under different buffer sizes, alignment constraints, and concurrency levels, both in isolation and mixed with background workloads.
  5. Head‑to‑head comparison – The same checkpoint workload was executed with DataStates‑LLM and TorchSnapshot to establish a real‑world baseline.

Results & Findings

MetricPOSIX Bufferedliburing Bufferedliburing Direct (no agg.)liburing AggregatedDataStates‑LLMTorchSnapshot
Peak write throughput (GB/s)1.21.81.04.71.20.6
Metadata ops per checkpoint1.8 M1.5 M2.0 M0.4 M1.6 M1.4 M
Avg latency per tensor (µs)453055123862
Scaling (processes → 64)0.6× ideal0.8× ideal0.5× ideal0.95× ideal0.6× ideal0.4× ideal

Key takeaways

  • Aggregation matters: grouping small tensors into 1–4 MiB aligned buffers recovers > 90 % of the theoretical bandwidth of the underlying file system.
  • Direct I/O alone isn’t enough: without aggregation, bypassing the page cache actually hurts performance because the storage backend sees many tiny, misaligned requests.
  • io_uring’s low‑overhead submission reduces CPU time spent in the kernel, freeing cores for compute‑heavy training loops.
  • The authors’ prototype outperforms the two state‑of‑the‑art checkpoint engines by a large margin, especially when the checkpoint size exceeds several hundred gigabytes.

Practical Implications

  • Training pipelines can shave minutes (or even hours) off each checkpoint cycle, directly reducing total time‑to‑solution for models that require dozens of saves per epoch.
  • Infrastructure cost: Higher I/O efficiency means less pressure on parallel file systems, potentially allowing more jobs per cluster or smaller storage budgets.
  • Developer APIs – The paper’s aggregation pattern can be wrapped in a thin library (e.g., a PyTorch torch.utils.checkpoint extension) that automatically buffers tensors before issuing io_uring writes, requiring minimal code changes.
  • Hybrid cloud / on‑prem setups: By aligning buffers to the block size of remote object stores (e.g., S3‑compatible APIs), the same technique can improve checkpoint uploads to cloud‑based artifact repositories.
  • Debugging & observability: The micro‑benchmark suite can be repurposed as a health‑check tool to verify that a new storage tier (NVMe, burst‑buffer, etc.) is delivering expected throughput before large‑scale runs.

Limitations & Future Work

  • GPU‑to‑CPU transfer cost is still a dominant factor; the study assumes pinned host memory but does not explore emerging GPU‑direct‑storage (GDS) pathways.
  • Experiments were limited to a single HPC cluster configuration; results may differ on heterogeneous cloud storage stacks or on systems with different network topologies.
  • The aggregation logic is currently static (fixed buffer size); adaptive schemes that react to runtime tensor size distributions could yield further gains.
  • The authors note that fault‑tolerance (e.g., partial‑write recovery) and security (encryption at rest) were outside the scope and merit dedicated investigation.

Bottom line: By treating LLM checkpointing as a high‑performance I/O problem and leveraging kernel‑level APIs with smart aggregation, developers can achieve order‑of‑magnitude speedups—making massive model training more practical and cost‑effective.

Authors

  • Mikaila J. Gossman
  • Avinash Maurya
  • Bogdan Nicolae
  • Jon C. Calhoun

Paper Information

  • arXiv ID: 2512.24511v1
  • Categories: cs.DC
  • Published: December 30, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »