[Paper] SwitchDelta: Asynchronous Metadata Updating for Distributed Storage with In-Network Data Visibility
Source: arXiv - 2511.19978v1
Overview
The paper introduces SwitchDelta, a novel technique that pushes metadata updates into programmable network switches, allowing data to become visible before the traditional metadata write completes. By decoupling metadata from the critical write path, SwitchDelta speeds up ordered writes while still guaranteeing strong consistency—an attractive proposition for any developer building high‑performance distributed storage services.
Key Contributions
- In‑network metadata buffering: Uses P4‑programmable switches to temporarily store metadata updates, making newly written data visible to clients without waiting for the metadata node.
- Best‑effort data‑plane design: Introduces lightweight mechanisms (e.g., compact encoding, selective eviction) that respect the limited memory and processing budget of switches.
- Metadata update protocol: A new protocol that reconciles the switch‑cached metadata with the persistent metadata store, ensuring eventual consistency and crash safety.
- Broad evaluation: Demonstrates the approach on three representative in‑memory storage systems (log‑structured KV store, distributed file system, secondary index) and shows up to 52 % latency reduction and 127 % throughput boost on write‑heavy workloads.
Methodology
- System Model – The authors assume a classic two‑tier architecture: data nodes store the actual payload, while a separate metadata service tracks object locations, versions, and visibility flags.
- Switch‑side Buffer – When a client issues a write, the data node stores the payload first. The accompanying metadata update (e.g., “object X is now at version V”) is encapsulated in a small packet and forwarded to a programmable switch. The switch stores this update in a tiny hash‑based cache.
- In‑network Visibility – While the metadata is still in the switch, any read request that traverses the same switch can be answered directly from the cached entry, effectively “seeing” the new data instantly.
- Commit & Reconciliation – The metadata node later receives the same update (via a reliable control channel). It writes the entry to durable storage and sends an acknowledgment. The switch then either discards the cached copy or marks it as committed. If the switch crashes or the entry expires, the system falls back to the traditional path, preserving strong consistency.
- Resource Management – Because switches have a few megabytes of SRAM, the design employs:
- Compact encoding (bit‑fields for version, object ID, etc.)
- Eviction policies that prioritize recent writes and drop stale entries.
- Fallback handling for cache misses, ensuring correctness without performance penalties.
Results & Findings
| Workload | Latency Reduction | Throughput Gain |
|---|---|---|
| Write‑heavy KV store | ≈ 52 % lower 99th‑percentile latency | ≈ 127 % higher ops/sec |
| Distributed file system (small files) | 38 % latency cut | 94 % throughput boost |
| Secondary index (range scans) | 31 % latency cut | 68 % throughput boost |
Key observations
- Scalability: Benefits grow with the proportion of writes; read‑only workloads see negligible impact (as expected).
- Switch load: Even with modest SRAM (≈ 2 MiB), the switch can buffer thousands of metadata updates without saturating its pipeline.
- Failure resilience: In simulated switch failures, the system gracefully reverts to the classic ordered‑write path with no loss of consistency.
Practical Implications
- Faster write‑heavy services: Cloud databases, log‑structured caches, and object stores can shave off tens of milliseconds per write, directly translating to lower tail latency for user‑facing APIs.
- Cost‑effective scaling: Instead of provisioning more powerful metadata servers, operators can invest in inexpensive programmable switches (e.g., Tofino) to achieve similar performance gains.
- Simplified client logic: Clients continue to use the standard read/write APIs; the visibility boost is transparent, requiring only a small library to encode metadata packets.
- Potential for hybrid cloud: Edge or ISP switches could host the metadata buffer, bringing write visibility closer to the client and reducing cross‑region round‑trips.
Limitations & Future Work
- Switch resource constraints: The approach relies on a small amount of SRAM; extremely high write rates could cause cache thrashing, limiting scalability.
- Protocol complexity: Adding a control channel between metadata nodes and switches introduces extra engineering effort and debugging surface.
- Security & isolation: Exposing metadata to the data plane raises questions about access control and multi‑tenant isolation, which the paper only touches on.
- Future directions: The authors suggest exploring adaptive cache sizing, integrating with emerging P4‑runtime APIs for dynamic reconfiguration, and extending the model to persistent (SSD‑based) storage where write latency is higher.
Authors
- Junru Li
- Qing Wang
- Zhe Yang
- Shuo Liu
- Jiwu Shu
- Youyou Lu
Paper Information
- arXiv ID: 2511.19978v1
- Categories: cs.DC, cs.DB
- Published: November 25, 2025
- PDF: Download PDF