Building a File Copier 4x Faster Than cp Using io_uring

Published: (January 7, 2026 at 12:45 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Introduction

I built a high‑performance file copier for machine‑learning datasets using Linux io_uring. On the right workload it is 4.2× faster than cp -r. Below are the lessons learned about when async I/O helps—and when it doesn’t.

Typical ML Dataset Sizes

DatasetFilesTypical Size
ImageNet1.28 M100–200 KB JPEG
COCO330 K50–500 KB
MNIST70 K784 bytes
CIFAR‑1060 K3 KB

Copying these with cp -r is painfully slow because each file requires multiple syscalls (open, read, write, close) that the kernel processes sequentially. For 100 000 files this means 400 000+ syscalls executed one after another.

Why io_uring Helps

  • Batched submission – queue dozens of operations and submit with a single syscall.
  • Async completion – operations finish out of order, allowing the CPU to keep working.
  • Zero‑copy – splice data directly between file descriptors via kernel pipes, avoiding userspace buffers.

Instead of:

open → read → write → close → repeat

we do:

submit 64 opens → process completions → submit reads/writes → batch everything

Architecture Overview

┌──────────────┐     ┌─────────────────┐     ┌─────────────────────┐
│ Main Thread  │────▶│  WorkQueue   │────▶│  Worker Threads     │
│ (scanner)    │     │  (thread‑safe)│     │  (per‑thread uring) │
└──────────────┘     └─────────────────┘     └─────────────────────┘

Each file progresses through a state machine:

OPENING_SRC → STATING → OPENING_DST → SPLICE_IN ⇄ SPLICE_OUT → CLOSING

Key Design Decisions

  • 64 files in‑flight per worker simultaneously.
  • Per‑thread io_uring instances (avoids lock contention).
  • Inode sorting for sequential disk access.
  • Splice zero‑copy for data transfer (source → pipe → destination).
  • Buffer pool with 4 KB‑aligned allocations (compatible with O_DIRECT).

Benchmarks

NVMe (fast local storage)

Workloadcp -ruring‑syncSpeedup
100 K × 4 KB files (400 MB)7.67 s5.14 s1.5×
100 K × 100 KB files (10 GB)22.7 s5.4 s4.2×

Cloud SSD (e.g., GCP Compute Engine)

Workloadcp -ruring‑syncSpeedup
100 K × 4 KB files67.7 s31.5 s2.15×
100 K × 100 KB files139.6 s64.7 s2.16×

Larger files benefit more from io_uring on fast storage because the CPU spends less time waiting for I/O and more time overlapping operations.

File‑Copy State Machine (C++)

enum class FileState {
    OPENING_SRC,    // Opening source file
    STATING,        // Getting file size
    OPENING_DST,    // Creating destination
    SPLICE_IN,      // Reading into kernel pipe
    SPLICE_OUT,     // Writing from pipe to dest
    CLOSING_SRC,    // Closing source
    CLOSING_DST,    // Closing destination
    DONE
};

Completions drive state transitions: when a completion arrives, the corresponding file context is looked up and its state is advanced.

Zero‑Copy with splice

// Splice from source into pipe
io_uring_prep_splice(sqe, src_fd, offset,
                     pipe_write_fd, -1,
                     chunk_size, 0);

// Splice from pipe to destination
io_uring_prep_splice(sqe, pipe_read_fd, -1,
                     dst_fd, offset,
                     chunk_size, 0);

Data never touches userspace; the kernel moves pages directly between file descriptors.

Inode Sorting

std::sort(files.begin(), files.end(),
    [](const auto& a, const auto& b) { return a.inode 

Benchmarks were run on Ubuntu 24.04 with kernel 6.14 on a local NVMe drive and on GCP Compute Engine VMs.

Back to Blog

Related posts

Read more »