Artemis II fault tolerance

Published: (May 1, 2026 at 01:39 PM EDT)
3 min read

Source: Hacker News

Communications of the ACM published a fascinating post about how NASA built Artemis II’s fault‑tolerant computer. Below are three key excerpts.

Eight modules with several backup scenarios

Orion utilizes two Vehicle Management Computers, each containing two Flight Control Modules, for a total of four FCMs. But the redundancy goes even deeper: each FCM consists of a self‑checking pair of processors.
Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a “fail‑silent” design. The self‑checking pairs ensure that if a CPU performs an erroneous calculation due to a radiation event, the error is detected immediately and the system responds.
“We can lose three FCMs in 22 seconds and still ride through safely on the last FCM,” said Uitenbroek. A silenced FCM doesn’t become dead weight, however; the system is designed to reset, re‑synchronize its state with the operating modules, and re‑join the group mid‑flight.

Multiple redundancies with deterministic error‑checking

“This architecture ensures that each FCM sees the same inputs, runs the same application code, and produces the same outputs,” said Uitenbroek. Every second, the drift of any individual FCM is measured and its local clock is recalibrated to the network’s ‘true’ time. If an application fails to meet its strict deadline, the module is automatically silenced, reset, and re‑synchronized.
The hardware itself is also reinforced. The system employs triple‑modular‑redundant memory that self‑corrects single‑bit errors on every read. Even the network interface cards utilize two lanes of traffic that are constantly compared, ensuring that a bit flip in the communication fabric results in a fail‑silent event rather than a corrupted command. The network itself is triple redundant with three separate planes, and all network switches employ self‑checking strategies.

Dissimilar redundancies

While the four‑FCM primary system is robust, NASA must still account for common‑mode failures—software bugs or catastrophic events that could theoretically impact all primary channels simultaneously.
To mitigate this, Orion carries a completely independent Backup Flight Software (BFS) system. This is a prime example of dissimilar redundancy. It is implemented on different hardware, runs a different operating system, and utilizes independently developed, simplified flight software.
Even in a total power loss scenario—called a “dead bus”—Orion is designed to survive. If power is restored, the spacecraft enters a safe mode, in which the vehicle first stabilizes itself and then points its solar arrays at the Sun to recover power. Then, it orients its tail toward the Sun for thermal stability before attempting to re‑establish communication with Earth. During such a failure, the crew can also take manual action to configure life‑support systems or don space suits.

These redundancy strategies come at a high cost, but they illustrate valuable principles for designing fault‑tolerant systems in any high‑reliability domain.

0 views
Back to Blog

Related posts

Read more »

When Networking Doesn't Work

My Windows 11 → Tyan SMDC IPMI Troubleshooting Story _Last week I spent far too much time trying to get my Windows 11 machine to talk to an antique Tyan SMDC S...