[Paper] FlexKV: Flexible Index Offloading for Memory-Disaggregated Key-Value Store

Published: (December 17, 2025 at 11:03 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.16148v1

Overview

The paper introduces FlexKV, a new design for key‑value (KV) stores that run on memory‑disaggregated data‑center architectures. By moving the index processing from the remote memory pool onto the compute nodes, FlexKV dramatically speeds up lookups while keeping the benefits of disaggregated memory—higher utilization and easier scaling.

Key Contributions

  • Index Proxying: Dynamically offloads KV index handling to compute nodes, exploiting their powerful CPUs instead of relying on slow one‑sided atomic ops in remote memory.
  • Rank‑Aware Hotness Detection: A lightweight algorithm that continuously monitors key “hotness” and redistributes index shards to keep load balanced across compute nodes.
  • Two‑Level Compute‑Node Memory Optimization: Combines a fast on‑chip cache with a managed off‑chip buffer, allowing the index to fit within limited compute‑node memory without sacrificing performance.
  • RPC‑Aggregated Cache Management: Batches remote cache‑coherence messages to cut down network traffic and latency caused by frequent coherence updates.
  • Performance Gains: Demonstrates up to 2.94× higher throughput and 85.2 % lower latency versus the best existing memory‑disaggregated KV stores.

Methodology

FlexKV treats the KV index as a proxy that can migrate between the remote memory pool and the compute nodes:

  1. Hotness Ranking: Each compute node tracks access frequencies for index partitions. A global rank is computed, and hot partitions are proactively moved to nodes that are under‑utilized.
  2. Memory Tiering on Compute Nodes:
    • Level‑1: A small, fast cache (e.g., L3 or a dedicated DRAM slice) holds the most frequently accessed index entries.
    • Level‑2: A larger, slower buffer (still on the compute node) stores the rest of the offloaded index, spilling to remote memory only when needed.
  3. RPC‑Aggregated Cache Coherence: Instead of sending a coherence message for every single KV operation, FlexKV aggregates them into batched RPCs, dramatically reducing the number of network round‑trips.
  4. Evaluation: The authors prototype FlexKV on a cluster equipped with disaggregated memory (e.g., Intel Optane‑DC or similar) and compare it against leading systems like Memcached‑DM and Remote‑Atomic‑KV. Benchmarks include YCSB workloads with varying read/write mixes and key‑size distributions.

Results & Findings

MetricFlexKV vs. Best Prior Art
Throughput (max YCSB A)↑ 2.94×
Average Latency (read)↓ 85.2 %
Load Imbalance (std. dev.)Reduced by ~70 %
Cache‑Coherence TrafficCut by ~60 % thanks to RPC aggregation
Memory Footprint on Compute Nodes≤ 30 % of total index size (thanks to two‑level scheme)

The experiments show that offloading the index not only speeds up individual operations but also smooths out hot‑spot formation, leading to more predictable performance under mixed workloads.

Practical Implications

  • For Cloud Providers: FlexKV enables tighter packing of memory resources across tenants while still delivering low‑latency KV services, potentially lowering hardware costs.
  • For Developers of Distributed Databases: The rank‑aware hotness detection can be adopted as a plug‑in module to balance shard placement in any sharded store, not just KV.
  • Edge & Fog Computing: Small compute nodes with limited local memory can still host high‑performance KV indexes, making disaggregated memory viable for latency‑sensitive edge workloads.
  • Operational Simplicity: By reducing reliance on remote atomic primitives, FlexKV lessens the need for specialized NICs or RDMA‑only networks, allowing existing Ethernet‑based data‑center fabrics to be used.

Overall, FlexKV demonstrates a practical path to combine the scalability of memory disaggregation with the speed of local compute, a sweet spot many modern services (caching layers, session stores, feature‑flag databases) are eager to hit.

Limitations & Future Work

  • Compute‑Node Memory Bound: Although the two‑level scheme mitigates pressure, extremely large indexes may still exceed available compute‑node memory, forcing more frequent remote accesses.
  • Workload Sensitivity: The hotness detection algorithm assumes relatively stable access patterns; highly volatile workloads could cause frequent reshuffling, adding overhead.
  • Prototype Scope: Experiments were performed on a modest cluster; scaling to thousands of nodes and heterogeneous hardware (e.g., GPUs, ARM CPUs) remains untested.
  • Future Directions: The authors suggest exploring adaptive RPC batching based on network congestion, integrating machine‑learning predictors for hotness, and extending the proxy model to support secondary indexes and range queries.

Authors

  • Zhisheng Hu
  • Jiacheng Shen
  • Ming-Chang Yang

Paper Information

  • arXiv ID: 2512.16148v1
  • Categories: cs.DC
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] The HEAL Data Platform

Objective: The objective was to develop a cloud-based, federated system to serve as a single point of search, discovery and analysis for data generated under th...