[Paper] SafarDB: FPGA-Accelerated Distributed Transactions via Replicated Data Types

Published: 2 days ago (March 9, 2026 at 02:16 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.08003v1

Overview

SafarDB is a new system that moves the heavy lifting of distributed transaction replication onto an FPGA that sits directly on the network fabric. By tightly coupling a custom network interface with a replication engine, the authors achieve dramatically lower latency and higher throughput for both conflict‑free (CRDT) and strongly‑consistent (WRDT) data types—key building blocks for modern, highly‑available services.

Key Contributions

Network‑attached FPGA replication engine – a purpose‑built NIC that runs replication logic on the FPGA, eliminating the CPU‑NIC round‑trip overhead.
Unified support for CRDTs and WRDTs – accelerates both relaxed (commutative) and strong‑ordering replication paths, including the consensus control path required for WRDTs.
Performance gains – up to 7× lower latency and 5.3× higher throughput for CRDTs, and 12× lower latency and 6.8× higher throughput for WRDTs versus the best RDMA‑based solutions.
Improved fault tolerance – faster leader failure detection and leader election, and higher resilience to crash‑failures compared with CPU/RDMA implementations.
Co‑design methodology – demonstrates how redesigning the NIC to match application semantics can unlock hardware acceleration benefits beyond traditional Smart‑NIC offloads.

Methodology

Hardware‑software co‑design – The team built a custom FPGA card that hosts both a lightweight network stack and the replication engine. The network stack parses incoming packets, extracts transaction metadata, and forwards it directly to the replication logic without involving the host CPU.
Replication primitives –
- CRDT path – Implements commutative operations (e.g., counters, sets) that can be applied in any order, allowing the FPGA to apply updates immediately.
- WRDT path – For operations that need strong ordering, the FPGA runs a consensus protocol (a streamlined Paxos/Raft variant) to agree on a total order before applying the update.
Operator offload – Frequently used data‑type operators (merge, apply, conflict resolution) are compiled into FPGA logic, enabling “near‑network execution” of the transaction.
Evaluation setup – Experiments were run on a 10‑GbE testbed with a multi‑node cluster. Baselines included a state‑of‑the‑art RDMA‑based CRDT/WRDT library running on CPUs. Metrics captured latency, throughput, and failure‑recovery times under varying contention and failure scenarios.

Results & Findings

Metric	SafarDB (FPGA)	RDMA‑CPU Baseline	Improvement
CRDT latency (median)	~30 µs	~210 µs	7×
CRDT throughput	1.2 M ops/s	225 k ops/s	5.3×
WRDT latency (median)	~45 µs	~540 µs	12×
WRDT throughput	800 k ops/s	118 k ops/s	6.8×
Leader failure detection	<150 µs	>1 ms	>6× faster
Crash‑failure resilience	No throughput drop under node loss	Significant slowdown	More robust

The numbers show that moving replication logic onto the network‑attached FPGA not only cuts the round‑trip time but also frees the host CPU to handle application logic, leading to higher overall system scalability.

Practical Implications

Micro‑services & stateful edge services – Developers can offload replication of shared state (counters, leaderboards, configuration maps) to the FPGA, achieving sub‑100 µs consistency guarantees without sacrificing CPU cycles.
Database sharding & multi‑master setups – SafarDB’s WRDT support makes it feasible to run strongly consistent transactions across geographically distributed replicas with dramatically lower commit latency.
High‑frequency trading, IoT gateways, and gaming back‑ends – Scenarios that demand ultra‑low latency updates can benefit from near‑network execution of CRDT/WRDT operations.
Simplified infrastructure – By integrating the NIC and replication engine, operators can reduce the number of moving parts (no separate Smart‑NIC firmware, no RDMA tuning), easing deployment in modern data‑center fabrics that already use FPGA accelerators.
Cost‑performance trade‑off – While FPGA cards have a higher upfront cost than pure‑CPU servers, the throughput gains and reduced CPU load can lower total cost of ownership for workloads that are replication‑bound.

Limitations & Future Work

Hardware dependence – The performance boost hinges on having FPGA‑enabled NICs; environments without such hardware cannot reap the benefits.
Programming model – Developers must express their data‑type logic in a hardware‑friendly form (e.g., using HLS), which adds a learning curve compared to pure‑software libraries.
Scalability beyond a single rack – The current prototype focuses on a single‑rack, 10‑GbE network; extending the design to multi‑rack, multi‑datacenter topologies may require additional protocol optimizations.
Consensus protocol flexibility – The built‑in consensus is tuned for WRDTs; supporting alternative protocols (e.g., Byzantine fault tolerance) would broaden applicability.

Future research directions include exposing higher‑level APIs for developers, integrating with popular distributed databases (e.g., CockroachDB, TiDB), and exploring hybrid CPU‑FPGA pipelines that dynamically shift workloads based on contention patterns.

Authors

Javad Saberlatibari
Prithviraj Yuvaraj
Mohsen Lesani
Philip Brisk
Mohammad Sadoghi

Paper Information

arXiv ID: 2603.08003v1
Categories: cs.DC
Published: March 9, 2026
PDF: Download PDF

[Paper] SafarDB: FPGA-Accelerated Distributed Transactions via Replicated Data Types

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

[Paper] Rate-Distortion Bounds for Heterogeneous Random Fields on Finite Lattices

[Paper] Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

[Paper] Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy