Redis + AOF + Distributed Storage: A Cautionary Benchmark

Published: (March 7, 2026 at 06:13 PM EST)
7 min read
Source: Dev.to

Source: Dev.to

Overview

We put AOF persistence through 9 configurations across local SSD SAS and Longhorn. The results are definitive.

When designing a caching layer for a production migration to bare‑metal Kubernetes, we faced a question that sounds simple but turned out to have an expensive answer: should Redis AOF persistence live on Longhorn distributed storage?

The Redis documentation hints at the answer, but intuition and documentation are not the same as production data. So we ran redis‑benchmark across nine configurations—varying storage backend, persistence settings, and dataset size—and measured the impact empirically.

The results are unambiguous, and one number in particular should give any architect pause.

Test Configuration

All tests used the same parameters throughout:

requests:    50,000
clients:     20 parallel
payload:     180,000 bytes (~180 KB)
pipeline:    keep-alive=1
thread:      single‑threaded

The 180 KB payload is intentional—it reflects realistic cache object sizes for the production workload being benchmarked, not the micro‑payload tests commonly seen in vendor benchmarks.

Nine Environments Tested

LabelStorageAOFRDBDataset
Local · AOF offLocal SSD SASNoThresholdsEmpty
Local · AOF on (baseline)Local SSD SASYesThresholdsEmpty
Local · AOF on (tuning 1)Local SSD SASYesThresholdsEmpty
Local · AOF on (tuning 2)Local SSD SASYesThresholdsEmpty
Local · AOF on (t2 + data)Local SSD SASYesThresholds375,795 keys
Longhorn · AOF on (empty)LonghornYesThresholdsEmpty
Longhorn · AOF on (data)LonghornYesThresholds375,795 keys

SET Throughput: The Core Finding

The most important metric for a write‑capable cache is SET throughput under load.

ConfigurationSET RPSSET avg latencySET p99 latency
Local · AOF off7,6961.47 ms5.12 ms
Local · AOF on (baseline)1,27514.39 ms102.53 ms
Local · AOF on (tuning 1)1,25115.03 ms105.92 ms
Local · AOF on (tuning 2)1,24815.03 ms112.38 ms
Local · AOF on (t2 + 375K keys)1,21215.85 ms121.15 ms
Longhorn · AOF on (empty)57733.56 ms225.66 ms
Longhorn · AOF on (375K keys)53736.17 ms201.86 ms

Takeaway:

  • Local SSD SAS with AOF disabled: 7,696 SET RPS, p99 ≈ 5 ms.
  • Longhorn with AOF enabled: 537 SET RPS, p99 ≈ 202 ms.

That is a 14.3× throughput difference and a 39× p99 latency difference—on the same application code, same Redis version, same client. The worst‑case single SET on Longhorn reached 903 ms.

The AOF Wall on Local Storage

Even on fast local SSD SAS, AOF incurs a heavy penalty.

MetricAOF off (RDB only)AOF on
SET p99 latency3.8–5.1 ms102–121 ms
SET average latency~1.5 ms14–16 ms

Roughly a 20× p99 latency penalty just from AOF on local SSD SAS, and tuning provides little relief:

Baseline:  102.5 ms p99
Tuning 1:  105.9 ms p99
Tuning 2:  112.4 ms p99  ← actually got worse

Why?
AOF with appendfsync everysec forces a fsync() at least once per second. On a busy single‑threaded Redis instance processing 180 KB payloads, that fsync stall dominates the latency budget. You cannot tune your way past it.

Why Longhorn Makes AOF Catastrophic

Longhorn is a distributed block‑storage system for Kubernetes that replicates data across nodes for durability. This works well for workloads with controlled write patterns, but Redis AOF is continuous, small, latency‑sensitive:

  • Each AOF append traverses the network to the Longhorn controller.
  • The controller replicates the write to N replicas before acknowledging.
  • Only after the acknowledgment does Redis receive its fsync confirmation.
  • Redis is single‑threaded, so it blocks waiting for that round‑trip.

Result: every SET pays the cost of a network round‑trip plus multi‑replica write confirmation. At 180 KB payloads, the penalty explodes.

Redis documentation warns:

“Avoid storing AOF/RDB files on storage that has network latency in the I/O path, such as NFS mounts.”

Longhorn is effectively that—a network‑replicated volume. Our benchmark quantifies the warning: 903 ms max latency, 202 ms p99.

GET Performance

Read latency is far less affected by persistence settings because GETs do not write to the AOF log.

ConfigurationGET RPSGET avg latency
Local · AOF off8,0271.47 ms
Local · AOF on (baseline)2,5374.29 ms
Longhorn · AOF on (375K keys)2,5224.21 ms

Longhorn does not significantly degrade GET performance compared to AOF‑on local storage, confirming that the bottleneck is the persistence write path.

PING Latency: The Baseline

PING throughput shows the overhead when no persistence is involved.

ConfigurationPING RPSPING avg latency
Local · AOF off~37,0000.32 ms
Local · AOF on (baseline)11,000–18,0000.84–1.50 ms
Longhorn · AOF on19,000–21,0000.74–0.83 ms

Interestingly, PING performance on Longhorn is better than AOF‑on‑local at baseline. The Longhorn penalty only materializes when Redis actually needs to write to the AOF log—confirming that the bottleneck is specifically the persistence write path, not general Longhorn I/O overhead.

Based on these results, the right architecture for a write‑heavy cache that requires durability is:

  1. Run Redis on local, high‑performance SSD (or NVMe) storage.
  2. Disable AOF for the cache tier, relying on periodic RDB snapshots if durability is needed.
  3. If AOF is required, mount the AOF file on a local disk, not on network‑replicated volumes such as Longhorn, NFS, or similar.
  4. Separate concerns:
    • Use Redis (or another in‑memory store) for the hot cache layer.
    • Use a traditional database or a durable key‑value store for the persistent layer.
  5. Monitor fsync latency (latency monitor in Redis) to ensure the appendfsync policy does not become a hidden bottleneck.

Split‑Persistence Design

Hot Path (Primary)

  • Redis with AOF disabled
  • RDB snapshots only, using generous thresholds (e.g., save 3600 1)
  • Local‑path storage on SSD SAS
  • Result: 7,600+ SET RPS, sub‑5 ms p99 latency

Recovery Path (Replica)

  • Redis replica of the primary
  • RDB‑only snapshots to persistent storage (Longhorn is acceptable here – snapshot writes are infrequent and bursty)
  • Not in the hot write path

This configuration delivers sub‑5 ms p99 latency at full write throughput while preserving durability via the replica’s periodic snapshots. If the primary fails, you lose at most one RDB snapshot interval of data – which is acceptable for most cache workloads.

If true durability for every write is required (rare for a cache), the correct solution is a different tool – not Redis with AOF on distributed storage.

Summary

QuestionAnswer
Can AOF on local SSD SAS achieve good SET latency?No. p99 stays above 100 ms regardless of tuning.
Can AOF on Longhorn achieve acceptable SET latency?No. p99 reaches 202 ms, max 903 ms.
Does Longhorn affect GET performance with AOF?Minimally – GETs don’t write to AOF.
What’s the right architecture for high‑throughput caching?AOF disabled on hot path, RDB replica for recovery.
Is the Redis documentation warning about network storage accurate?Definitively yes. Our data confirms it.

The 14× throughput gap between AOF‑on‑Longhorn and AOF‑off‑local is not a configuration problem; it’s an architectural mismatch. Building a fast cache on slow persistence is a contradiction – and these numbers prove it.

Environment Details

  • Redis version: 7.x
  • Storage backends:
    • Local‑path provisioner (SSD SAS)
    • Longhorn 1.6 on Kubernetes 1.31
  • redis‑benchmark parameters: -n 50000 -c 20 -d 180000 --keepalive 1
  • Mode: Single‑threaded (no --threads flag)
  • Dataset: Empty at baseline; 375,795 keys for loaded tests

Questions about Redis architecture on Kubernetes? Leave a comment below.

— Iwan Setiawan, Hybrid Cloud & Platform Architect · portfolio.kangservice.cloud

0 views
Back to Blog

Related posts

Read more »