Building a File Copier 4x Faster Than cp Using io_uring

Published: 1 month ago (January 7, 2026 at 12:45 PM EST)

2 min read

Source: Dev.to

Introduction

I built a high‑performance file copier for machine‑learning datasets using Linux io_uring. On the right workload it is 4.2× faster than cp -r. Below are the lessons learned about when async I/O helps—and when it doesn’t.

Typical ML Dataset Sizes

Dataset	Files	Typical Size
ImageNet	1.28 M	100–200 KB JPEG
COCO	330 K	50–500 KB
MNIST	70 K	784 bytes
CIFAR‑10	60 K	3 KB

Copying these with cp -r is painfully slow because each file requires multiple syscalls (open, read, write, close) that the kernel processes sequentially. For 100 000 files this means 400 000+ syscalls executed one after another.

Why io_uring Helps

Batched submission – queue dozens of operations and submit with a single syscall.
Async completion – operations finish out of order, allowing the CPU to keep working.
Zero‑copy – splice data directly between file descriptors via kernel pipes, avoiding userspace buffers.

Instead of:

open → read → write → close → repeat

we do:

submit 64 opens → process completions → submit reads/writes → batch everything

Architecture Overview

┌──────────────┐     ┌─────────────────┐     ┌─────────────────────┐
│ Main Thread  │────▶│  WorkQueue   │────▶│  Worker Threads     │
│ (scanner)    │     │  (thread‑safe)│     │  (per‑thread uring) │
└──────────────┘     └─────────────────┘     └─────────────────────┘

Each file progresses through a state machine:

OPENING_SRC → STATING → OPENING_DST → SPLICE_IN ⇄ SPLICE_OUT → CLOSING

Key Design Decisions

64 files in‑flight per worker simultaneously.
Per‑thread io_uring instances (avoids lock contention).
Inode sorting for sequential disk access.
Splice zero‑copy for data transfer (source → pipe → destination).
Buffer pool with 4 KB‑aligned allocations (compatible with O_DIRECT).

Benchmarks

NVMe (fast local storage)

Workload	`cp -r`	`uring‑sync`	Speedup
100 K × 4 KB files (400 MB)	7.67 s	5.14 s	1.5×
100 K × 100 KB files (10 GB)	22.7 s	5.4 s	4.2×

Cloud SSD (e.g., GCP Compute Engine)

Workload	`cp -r`	`uring‑sync`	Speedup
100 K × 4 KB files	67.7 s	31.5 s	2.15×
100 K × 100 KB files	139.6 s	64.7 s	2.16×

Larger files benefit more from io_uring on fast storage because the CPU spends less time waiting for I/O and more time overlapping operations.

File‑Copy State Machine (C++)

enum class FileState {
    OPENING_SRC,    // Opening source file
    STATING,        // Getting file size
    OPENING_DST,    // Creating destination
    SPLICE_IN,      // Reading into kernel pipe
    SPLICE_OUT,     // Writing from pipe to dest
    CLOSING_SRC,    // Closing source
    CLOSING_DST,    // Closing destination
    DONE
};

Completions drive state transitions: when a completion arrives, the corresponding file context is looked up and its state is advanced.

Zero‑Copy with splice

// Splice from source into pipe
io_uring_prep_splice(sqe, src_fd, offset,
                     pipe_write_fd, -1,
                     chunk_size, 0);

// Splice from pipe to destination
io_uring_prep_splice(sqe, pipe_read_fd, -1,
                     dst_fd, offset,
                     chunk_size, 0);

Data never touches userspace; the kernel moves pages directly between file descriptors.

Inode Sorting

std::sort(files.begin(), files.end(),
    [](const auto& a, const auto& b) { return a.inode

Benchmarks were run on Ubuntu 24.04 with kernel 6.14 on a local NVMe drive and on GCP Compute Engine VMs.