Building a Fast File Transfer Tool, Part 2: Beating rsync by 58% with kTLS

Published: (January 7, 2026 at 04:19 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

The Problem: SSH Is the Bottleneck

When transferring ML datasets between machines, rsync over SSH is the go‑to tool:

rsync -az /data/ml_dataset user@server:/backup/

It works, but it’s slow. For a 9.7 GB dataset (≈ 100 K files), rsync took 390 s25 MiB s⁻¹.

The bottleneck isn’t the network; it’s userspace encryption.

┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐
│  File   │────▶│ rsync   │────▶│  SSH    │────▶│ Network │
│  Read   │     │ (delta) │     │ encrypt │     │  Send   │
└─────────┘     └─────────┘     └─────────┘     └─────────┘

                     Context switches,
                     userspace copies,
                     CPU‑bound AES

Every byte passes through the SSH process, which encrypts it with OpenSSL in userspace. This involves:

  • Multiple context switches between kernel and userspace
  • Copying data between kernel buffers and userspace buffers
  • CPU time for AES (even with AES‑NI)

The Solution: kTLS (Kernel TLS)

Linux 4.13+ supports kTLS – TLS encryption handled directly in the kernel. Once the TLS session is set up, the kernel encrypts data as it flows through the socket.

┌─────────┐     ┌─────────┐     ┌──────────────────┐
│  File   │────▶│  read   │────▶│ Socket (kTLS)    │
│         │     │         │     │ encrypt + send  │
└─────────┘     └─────────┘     └──────────────────┘

                               One kernel operation,
                               no userspace copies,
                               AES‑NI in kernel

Benefits

  • No userspace encryption process – the kernel handles it directly
  • Fewer copies – data never bounces through userspace
  • AES‑NI in kernel – hardware acceleration without extra context switches

Implementation

Setting up kTLS requires:

  1. TLS handshake – exchange keys (we use a pre‑shared secret + HKDF)
  2. Configure kernelsetsockopt(SOL_TLS, TLS_TX, …) with cipher keys
  3. Send data – regular send() calls; the kernel encrypts automatically
/* After deriving keys from the shared secret … */
struct tls12_crypto_info_aes_gcm_128 crypto_info = {
    .info.version      = TLS_1_2_VERSION,
    .info.cipher_type  = TLS_CIPHER_AES_GCM_128,
};
memcpy(crypto_info.key,  key, 16);
memcpy(crypto_info.iv,   iv,  8);
memcpy(crypto_info.salt, salt, 4);

setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info));
/* Now all send() calls are automatically encrypted! */

Benchmark Results

Test environment: Laptop → GCP VM (public Internet)

Headline Numbers

Dataseturing‑sync + kTLSrsync (SSH)Improvement
ml_small (60 MiB, 10 K files)2.98 s2.63 s~equal
ml_large (589 MiB, 100 K files)16.4 s24.8 s34 % faster
ml_images (9.7 GiB, 100 K files)165 s390 s58 % faster

The Pattern

Data size:    60 MiB → 589 MiB → 9.7 GiB
Improvement:   0 %   →   34 %   →    58 %

The larger the transfer, the bigger the kTLS advantage.
Reason: per‑connection overhead (handshake, key derivation) is amortized over more data, while SSH’s userspace encryption overhead grows linearly with data size.

Throughput Comparison

MethodThroughputCPU Usage
rsync (SSH)25 MiB s⁻¹High (userspace encryption)
uring‑sync + kTLS60 MiB s⁻¹Low (kernel encryption)

kTLS achieves 2.4× the throughput of rsync while using less CPU.

Why Not Zero‑Copy splice()?

In theory, kTLS supports splice() for true zero‑copy transfers:

File → Pipe → kTLS Socket   (no userspace copies!)

I implemented this expecting it to be the fastest path, but it turned out 2.9× slower.

Investigation

strace revealed the bottleneck:

splice(file → pipe):   27 µs   ← instant
splice(pipe → socket): 33 ms   ← 1000× slower!

splice(pipe → kTLS socket) blocks while waiting for TCP ACKs. The kernel cannot buffer aggressively as it does with regular send() calls.

Lesson

Zero‑copy isn’t always faster. For many‑file workloads:

  • read / send – kernel manages buffering efficiently
  • splice – blocks on each chunk, killing throughput

splice() may help for a single huge file, but for ML datasets (many small files) stick with read() + send().

When to Use This

Use kTLS file transfer when:

  • Transferring large datasets (> 500 MiB)
  • The network has bandwidth to spare
  • You control both endpoints
  • Security is required (not just a VPN)

Stick with rsync when:

  • You need delta sync (only changed bytes)
  • Destination already has partial data
  • Existing SSH infrastructure is sufficient
  • Simplicity matters more than raw speed

The Wire Protocol

Our protocol is intentionally minimal:

HELLO (secret hash) ──────────────────▶ Verify
                    ◀────────────────── HELLO_OK (+ enable kTLS)

FILE_HDR (path, size, mode) ──────────▶ Create file
FILE_DATA (chunks) ───────────────────▶ Write data
FILE_END ──────────────────────────────▶ Close file

(repeat for all files)

ALL_DONE ──────────────────────────────▶ Complete

End of Part 2.

uring‑sync: Fast Encrypted File Transfer with kTLS

No delta encoding, no checksums – kTLS provides integrity via GCM.
Just raw file transfer with authentication and encryption.

Usage

# Receiver (on remote host)
uring-sync recv /backup --listen 9999 --secret mykey --tls

# Sender (on local host)
uring-sync send /data remote-host:9999 --secret mykey --tls

Implementation Details

  • Key derivation: HKDF from the shared secret
  • Encryption: AES‑128‑GCM via kTLS
  • Transport: Simple TCP protocol (no HTTP, no gRPC)

Full source:

Conclusion

By moving encryption from userspace SSH to kernel kTLS, we achieved:

Metricuring‑syncrsync
Speed improvement58 % faster
Throughput60 MB/s (2.4×)25 MB/s
CPU usageLower (kernel AES‑NI)Higher (userspace OpenSSL)

Key insight: For bulk data transfer, SSH’s flexibility adds overhead. A purpose‑built tool with kernel encryption wins.

Appendix: Full Benchmark Data

Test Environment

  • Sender: Ubuntu laptop, local NVMe
  • Receiver: GCP VM (us‑central1‑a)
  • Network: Public internet
  • Cache: Cold cache (echo 3 > /proc/sys/vm/drop_caches)

Raw Results

DatasetFilesSizekTLS TimekTLS Speedrsync Timersync Speed
ml_small10 K60 MB2.98 s20 MB/s2.63 s23 MB/s
ml_large100 K589 MB16.4 s36 MB/s24.8 s24 MB/s
ml_images100 K9.7 GB165 s60 MB/s390 s25 MB/s

Splice Investigation (ml_images)

ModeTimeSpeedNotes
Plaintext + read/send146 s68 MB/sFastest (no encryption)
Plaintext + splice157 s63 MB/s+8 % overhead
kTLS + read/send165 s60 MB/s+13 % (encryption cost)
kTLS + splice428 s23 MB/s2.9× slower (broken)

Benchmarks run January 2026. Your mileage may vary depending on network conditions and hardware.

Tags: #linux #ktls #tls #rsync #performance #networking #encryption

Back to Blog

Related posts

Read more »

Profiling with Ctrl-C (2024)

Article URL: https://yosefk.com/blog/profiling-with-ctrl-c.html Comments URL: https://news.ycombinator.com/item?id=46475296 Points: 12 Comments: 1...

Profiling with Ctrl-C

Article URL: https://yosefk.com/blog/profiling-with-ctrl-c.html Comments URL: https://news.ycombinator.com/item?id=46475296 Points: 6 Comments: 0...