[Paper] SafarDB: FPGA-Accelerated Distributed Transactions via Replicated Data Types

Published: (March 9, 2026 at 02:16 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.08003v1

Overview

SafarDB is a new system that moves the heavy lifting of distributed transaction replication onto an FPGA that sits directly on the network fabric. By tightly coupling a custom network interface with a replication engine, the authors achieve dramatically lower latency and higher throughput for both conflict‑free (CRDT) and strongly‑consistent (WRDT) data types—key building blocks for modern, highly‑available services.

Key Contributions

  • Network‑attached FPGA replication engine – a purpose‑built NIC that runs replication logic on the FPGA, eliminating the CPU‑NIC round‑trip overhead.
  • Unified support for CRDTs and WRDTs – accelerates both relaxed (commutative) and strong‑ordering replication paths, including the consensus control path required for WRDTs.
  • Performance gains – up to 7× lower latency and 5.3× higher throughput for CRDTs, and 12× lower latency and 6.8× higher throughput for WRDTs versus the best RDMA‑based solutions.
  • Improved fault tolerance – faster leader failure detection and leader election, and higher resilience to crash‑failures compared with CPU/RDMA implementations.
  • Co‑design methodology – demonstrates how redesigning the NIC to match application semantics can unlock hardware acceleration benefits beyond traditional Smart‑NIC offloads.

Methodology

  1. Hardware‑software co‑design – The team built a custom FPGA card that hosts both a lightweight network stack and the replication engine. The network stack parses incoming packets, extracts transaction metadata, and forwards it directly to the replication logic without involving the host CPU.
  2. Replication primitives
    • CRDT path – Implements commutative operations (e.g., counters, sets) that can be applied in any order, allowing the FPGA to apply updates immediately.
    • WRDT path – For operations that need strong ordering, the FPGA runs a consensus protocol (a streamlined Paxos/Raft variant) to agree on a total order before applying the update.
  3. Operator offload – Frequently used data‑type operators (merge, apply, conflict resolution) are compiled into FPGA logic, enabling “near‑network execution” of the transaction.
  4. Evaluation setup – Experiments were run on a 10‑GbE testbed with a multi‑node cluster. Baselines included a state‑of‑the‑art RDMA‑based CRDT/WRDT library running on CPUs. Metrics captured latency, throughput, and failure‑recovery times under varying contention and failure scenarios.

Results & Findings

MetricSafarDB (FPGA)RDMA‑CPU BaselineImprovement
CRDT latency (median)~30 µs~210 µs
CRDT throughput1.2 M ops/s225 k ops/s5.3×
WRDT latency (median)~45 µs~540 µs12×
WRDT throughput800 k ops/s118 k ops/s6.8×
Leader failure detection<150 µs>1 ms>6× faster
Crash‑failure resilienceNo throughput drop under node lossSignificant slowdownMore robust

The numbers show that moving replication logic onto the network‑attached FPGA not only cuts the round‑trip time but also frees the host CPU to handle application logic, leading to higher overall system scalability.

Practical Implications

  • Micro‑services & stateful edge services – Developers can offload replication of shared state (counters, leaderboards, configuration maps) to the FPGA, achieving sub‑100 µs consistency guarantees without sacrificing CPU cycles.
  • Database sharding & multi‑master setups – SafarDB’s WRDT support makes it feasible to run strongly consistent transactions across geographically distributed replicas with dramatically lower commit latency.
  • High‑frequency trading, IoT gateways, and gaming back‑ends – Scenarios that demand ultra‑low latency updates can benefit from near‑network execution of CRDT/WRDT operations.
  • Simplified infrastructure – By integrating the NIC and replication engine, operators can reduce the number of moving parts (no separate Smart‑NIC firmware, no RDMA tuning), easing deployment in modern data‑center fabrics that already use FPGA accelerators.
  • Cost‑performance trade‑off – While FPGA cards have a higher upfront cost than pure‑CPU servers, the throughput gains and reduced CPU load can lower total cost of ownership for workloads that are replication‑bound.

Limitations & Future Work

  • Hardware dependence – The performance boost hinges on having FPGA‑enabled NICs; environments without such hardware cannot reap the benefits.
  • Programming model – Developers must express their data‑type logic in a hardware‑friendly form (e.g., using HLS), which adds a learning curve compared to pure‑software libraries.
  • Scalability beyond a single rack – The current prototype focuses on a single‑rack, 10‑GbE network; extending the design to multi‑rack, multi‑datacenter topologies may require additional protocol optimizations.
  • Consensus protocol flexibility – The built‑in consensus is tuned for WRDTs; supporting alternative protocols (e.g., Byzantine fault tolerance) would broaden applicability.

Future research directions include exposing higher‑level APIs for developers, integrating with popular distributed databases (e.g., CockroachDB, TiDB), and exploring hybrid CPU‑FPGA pipelines that dynamically shift workloads based on contention patterns.

Authors

  • Javad Saberlatibari
  • Prithviraj Yuvaraj
  • Mohsen Lesani
  • Philip Brisk
  • Mohammad Sadoghi

Paper Information

  • arXiv ID: 2603.08003v1
  • Categories: cs.DC
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »