Building a Fast File Transfer Tool, Part 2: Beating rsync by 58% with kTLS
Source: Dev.to
The Problem: SSH Is the Bottleneck
When transferring ML datasets between machines, rsync over SSH is the go‑to tool:
rsync -az /data/ml_dataset user@server:/backup/
It works, but it’s slow. For a 9.7 GB dataset (≈ 100 K files), rsync took 390 s → 25 MiB s⁻¹.
The bottleneck isn’t the network; it’s userspace encryption.
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ File │────▶│ rsync │────▶│ SSH │────▶│ Network │
│ Read │ │ (delta) │ │ encrypt │ │ Send │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
│
Context switches,
userspace copies,
CPU‑bound AES
Every byte passes through the SSH process, which encrypts it with OpenSSL in userspace. This involves:
- Multiple context switches between kernel and userspace
- Copying data between kernel buffers and userspace buffers
- CPU time for AES (even with AES‑NI)
The Solution: kTLS (Kernel TLS)
Linux 4.13+ supports kTLS – TLS encryption handled directly in the kernel. Once the TLS session is set up, the kernel encrypts data as it flows through the socket.
┌─────────┐ ┌─────────┐ ┌──────────────────┐
│ File │────▶│ read │────▶│ Socket (kTLS) │
│ │ │ │ │ encrypt + send │
└─────────┘ └─────────┘ └──────────────────┘
│
One kernel operation,
no userspace copies,
AES‑NI in kernel
Benefits
- No userspace encryption process – the kernel handles it directly
- Fewer copies – data never bounces through userspace
- AES‑NI in kernel – hardware acceleration without extra context switches
Implementation
Setting up kTLS requires:
- TLS handshake – exchange keys (we use a pre‑shared secret + HKDF)
- Configure kernel –
setsockopt(SOL_TLS, TLS_TX, …)with cipher keys - Send data – regular
send()calls; the kernel encrypts automatically
/* After deriving keys from the shared secret … */
struct tls12_crypto_info_aes_gcm_128 crypto_info = {
.info.version = TLS_1_2_VERSION,
.info.cipher_type = TLS_CIPHER_AES_GCM_128,
};
memcpy(crypto_info.key, key, 16);
memcpy(crypto_info.iv, iv, 8);
memcpy(crypto_info.salt, salt, 4);
setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info));
/* Now all send() calls are automatically encrypted! */
Benchmark Results
Test environment: Laptop → GCP VM (public Internet)
Headline Numbers
| Dataset | uring‑sync + kTLS | rsync (SSH) | Improvement |
|---|---|---|---|
ml_small (60 MiB, 10 K files) | 2.98 s | 2.63 s | ~equal |
ml_large (589 MiB, 100 K files) | 16.4 s | 24.8 s | 34 % faster |
ml_images (9.7 GiB, 100 K files) | 165 s | 390 s | 58 % faster |
The Pattern
Data size: 60 MiB → 589 MiB → 9.7 GiB
Improvement: 0 % → 34 % → 58 %
The larger the transfer, the bigger the kTLS advantage.
Reason: per‑connection overhead (handshake, key derivation) is amortized over more data, while SSH’s userspace encryption overhead grows linearly with data size.
Throughput Comparison
| Method | Throughput | CPU Usage |
|---|---|---|
| rsync (SSH) | 25 MiB s⁻¹ | High (userspace encryption) |
| uring‑sync + kTLS | 60 MiB s⁻¹ | Low (kernel encryption) |
kTLS achieves 2.4× the throughput of rsync while using less CPU.
Why Not Zero‑Copy splice()?
In theory, kTLS supports splice() for true zero‑copy transfers:
File → Pipe → kTLS Socket (no userspace copies!)
I implemented this expecting it to be the fastest path, but it turned out 2.9× slower.
Investigation
strace revealed the bottleneck:
splice(file → pipe): 27 µs ← instant
splice(pipe → socket): 33 ms ← 1000× slower!
splice(pipe → kTLS socket) blocks while waiting for TCP ACKs. The kernel cannot buffer aggressively as it does with regular send() calls.
Lesson
Zero‑copy isn’t always faster. For many‑file workloads:
- read / send – kernel manages buffering efficiently
- splice – blocks on each chunk, killing throughput
splice() may help for a single huge file, but for ML datasets (many small files) stick with read() + send().
When to Use This
Use kTLS file transfer when:
- Transferring large datasets (> 500 MiB)
- The network has bandwidth to spare
- You control both endpoints
- Security is required (not just a VPN)
Stick with rsync when:
- You need delta sync (only changed bytes)
- Destination already has partial data
- Existing SSH infrastructure is sufficient
- Simplicity matters more than raw speed
The Wire Protocol
Our protocol is intentionally minimal:
HELLO (secret hash) ──────────────────▶ Verify
◀────────────────── HELLO_OK (+ enable kTLS)
FILE_HDR (path, size, mode) ──────────▶ Create file
FILE_DATA (chunks) ───────────────────▶ Write data
FILE_END ──────────────────────────────▶ Close file
(repeat for all files)
ALL_DONE ──────────────────────────────▶ Complete
End of Part 2.
uring‑sync: Fast Encrypted File Transfer with kTLS
No delta encoding, no checksums – kTLS provides integrity via GCM.
Just raw file transfer with authentication and encryption.
Usage
# Receiver (on remote host)
uring-sync recv /backup --listen 9999 --secret mykey --tls
# Sender (on local host)
uring-sync send /data remote-host:9999 --secret mykey --tls
Implementation Details
- Key derivation: HKDF from the shared secret
- Encryption: AES‑128‑GCM via kTLS
- Transport: Simple TCP protocol (no HTTP, no gRPC)
Full source:
Conclusion
By moving encryption from userspace SSH to kernel kTLS, we achieved:
| Metric | uring‑sync | rsync |
|---|---|---|
| Speed improvement | 58 % faster | – |
| Throughput | 60 MB/s (2.4×) | 25 MB/s |
| CPU usage | Lower (kernel AES‑NI) | Higher (userspace OpenSSL) |
Key insight: For bulk data transfer, SSH’s flexibility adds overhead. A purpose‑built tool with kernel encryption wins.
Appendix: Full Benchmark Data
Test Environment
- Sender: Ubuntu laptop, local NVMe
- Receiver: GCP VM (us‑central1‑a)
- Network: Public internet
- Cache: Cold cache (
echo 3 > /proc/sys/vm/drop_caches)
Raw Results
| Dataset | Files | Size | kTLS Time | kTLS Speed | rsync Time | rsync Speed |
|---|---|---|---|---|---|---|
| ml_small | 10 K | 60 MB | 2.98 s | 20 MB/s | 2.63 s | 23 MB/s |
| ml_large | 100 K | 589 MB | 16.4 s | 36 MB/s | 24.8 s | 24 MB/s |
| ml_images | 100 K | 9.7 GB | 165 s | 60 MB/s | 390 s | 25 MB/s |
Splice Investigation (ml_images)
| Mode | Time | Speed | Notes |
|---|---|---|---|
| Plaintext + read/send | 146 s | 68 MB/s | Fastest (no encryption) |
| Plaintext + splice | 157 s | 63 MB/s | +8 % overhead |
| kTLS + read/send | 165 s | 60 MB/s | +13 % (encryption cost) |
| kTLS + splice | 428 s | 23 MB/s | 2.9× slower (broken) |
Benchmarks run January 2026. Your mileage may vary depending on network conditions and hardware.
Tags: #linux #ktls #tls #rsync #performance #networking #encryption