TIL: Byzantine Generals Problem in Real-World Distributed Systems
Source: Dev.to

Preface
When learning the Raft algorithm, the Byzantine Failure is usually excluded. Unexpectedly, CloudFlare’s incident report last November used the real‑world Byzantine problem as the title. I’ll use this to organize some thoughts.
What is Byzantine failure
In a distributed system, different computers communicate with each other as a consensus communication data‑confirmation process. It requires computers to report what they are going to do or to vote for a leader.
If a computer tells some members A one thing and another group of members B something else, causing the entire group to fail to reach consensus or reach an unexpected state, it is called a Byzantine Failure.
Many consensus algorithms such as Paxos and Raft initially assume that Byzantine failures do not exist because handling them raises the complexity of consensus to another level.
Reference articles
About CloudFlare’s recovery mechanism
Before exploring more complex issues, there is an interesting angle in CloudFlare’s incident report: how they view their backup mechanisms for system maintenance.
Service backup mechanism
- Each service is a series of rack servers.
- Each machine has two switches.
- Each rack has two or more power‑supply devices.
- Each server uses a RAID‑10 backup mechanism (RAID 1 + RAID 0).
- Each rack contains at least three machines.
The problem that occurred

Image explanation: Top left is Server 1, top right is Server 2, and below is Server 3, which is also the Leader.
- A network problem between Server 1 and Server 2 caused them to have inconsistent information.
- Server 1 believed the Leader (Server 3) was offline.
- Server 2 believed the Leader was running normally.
- This inconsistency is why CloudFlare labeled the incident a Byzantine Failure.
Reference
- Cloudflare Dashboard and Cloudflare API service issues
- A Byzantine failure in the real world (Cloudflare blog)
- Raft does not Guarantee Liveness in the face of Network Faults
- Wiki: Byzantine Generals Problem (Chinese)
- Raft lecture (Raft user study) by Diego Ongaro
- The Cloudflare Blog
- Improving the Resiliency of Our Infrastructure DNS Zone
- Link aggregation – Wikipedia
- The power of the adversary
- Pull requests · etcd-io/etcd
- Understanding the Byzantine Generals’ Problem (Medium)
- 拜占庭將軍問題 – 維基百科 (Chinese)
- Raft Consensus Algorithm