Building a File Copier 4x Faster Than cp Using io_uring
Source: Dev.to
Introduction
I built a high‑performance file copier for machine‑learning datasets using Linux io_uring. On the right workload it is 4.2× faster than cp -r. Below are the lessons learned about when async I/O helps—and when it doesn’t.
Typical ML Dataset Sizes
| Dataset | Files | Typical Size |
|---|---|---|
| ImageNet | 1.28 M | 100–200 KB JPEG |
| COCO | 330 K | 50–500 KB |
| MNIST | 70 K | 784 bytes |
| CIFAR‑10 | 60 K | 3 KB |
Copying these with cp -r is painfully slow because each file requires multiple syscalls (open, read, write, close) that the kernel processes sequentially. For 100 000 files this means 400 000+ syscalls executed one after another.
Why io_uring Helps
- Batched submission – queue dozens of operations and submit with a single syscall.
- Async completion – operations finish out of order, allowing the CPU to keep working.
- Zero‑copy – splice data directly between file descriptors via kernel pipes, avoiding userspace buffers.
Instead of:
open → read → write → close → repeat
we do:
submit 64 opens → process completions → submit reads/writes → batch everything
Architecture Overview
┌──────────────┐ ┌─────────────────┐ ┌─────────────────────┐
│ Main Thread │────▶│ WorkQueue │────▶│ Worker Threads │
│ (scanner) │ │ (thread‑safe)│ │ (per‑thread uring) │
└──────────────┘ └─────────────────┘ └─────────────────────┘
Each file progresses through a state machine:
OPENING_SRC → STATING → OPENING_DST → SPLICE_IN ⇄ SPLICE_OUT → CLOSING
Key Design Decisions
- 64 files in‑flight per worker simultaneously.
- Per‑thread io_uring instances (avoids lock contention).
- Inode sorting for sequential disk access.
- Splice zero‑copy for data transfer (
source → pipe → destination). - Buffer pool with 4 KB‑aligned allocations (compatible with
O_DIRECT).
Benchmarks
NVMe (fast local storage)
| Workload | cp -r | uring‑sync | Speedup |
|---|---|---|---|
| 100 K × 4 KB files (400 MB) | 7.67 s | 5.14 s | 1.5× |
| 100 K × 100 KB files (10 GB) | 22.7 s | 5.4 s | 4.2× |
Cloud SSD (e.g., GCP Compute Engine)
| Workload | cp -r | uring‑sync | Speedup |
|---|---|---|---|
| 100 K × 4 KB files | 67.7 s | 31.5 s | 2.15× |
| 100 K × 100 KB files | 139.6 s | 64.7 s | 2.16× |
Larger files benefit more from io_uring on fast storage because the CPU spends less time waiting for I/O and more time overlapping operations.
File‑Copy State Machine (C++)
enum class FileState {
OPENING_SRC, // Opening source file
STATING, // Getting file size
OPENING_DST, // Creating destination
SPLICE_IN, // Reading into kernel pipe
SPLICE_OUT, // Writing from pipe to dest
CLOSING_SRC, // Closing source
CLOSING_DST, // Closing destination
DONE
};
Completions drive state transitions: when a completion arrives, the corresponding file context is looked up and its state is advanced.
Zero‑Copy with splice
// Splice from source into pipe
io_uring_prep_splice(sqe, src_fd, offset,
pipe_write_fd, -1,
chunk_size, 0);
// Splice from pipe to destination
io_uring_prep_splice(sqe, pipe_read_fd, -1,
dst_fd, offset,
chunk_size, 0);
Data never touches userspace; the kernel moves pages directly between file descriptors.
Inode Sorting
std::sort(files.begin(), files.end(),
[](const auto& a, const auto& b) { return a.inode
Benchmarks were run on Ubuntu 24.04 with kernel 6.14 on a local NVMe drive and on GCP Compute Engine VMs.