Artemis II Fault Tolerance
Source: Hacker News
Overview
Artemis II’s flight software runs on a highly redundant architecture designed to survive radiation‑induced errors, hardware failures, and even total power loss. The system emphasizes “fail‑silent” behavior, deterministic error checking, and dissimilar redundancy to guard against common‑mode failures.
Excerpt 1 – Eight Modules with Multiple Backup Scenarios
“Orion utilizes two Vehicle Management Computers, each containing two Flight Control Modules, for a total of four FCMs. But the redundancy goes even deeper: each FCM consists of a self‑checking pair of processors. Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a ‘fail‑silent’ design. The self‑checking pairs ensure that if a CPU performs an erroneous calculation due to a radiation event, the error is detected immediately and the system responds.
‘We can lose three FCMs in 22 seconds and still ride through safely on the last FCM,’ said Uitenbroek. A silenced FCM doesn’t become dead weight, however; the system is designed to reset, re‑synchronize its state with the operating modules, and re‑join the group mid‑flight.”
Excerpt 2 – Multiple Redundancies with Deterministic Error‑Checking
“This architecture ensures that each FCM sees the same inputs, runs the same application code, and produces the same outputs,” said Uitenbroek. Every second, the drift of any individual FCM is measured and its local clock is recalibrated to the network’s ‘true’ time. If an application fails to meet its strict deadline, the module is automatically silenced, reset, and re‑synchronized.
The hardware itself is also reinforced. The system employs triple‑modular‑redundant memory that self‑corrects single‑bit errors on every read. Even the network interface cards utilize two lanes of traffic that are constantly compared, ensuring that a bit flip in the communication fabric results in a fail‑silent event rather than a corrupted command. The network itself is triple redundant with three separate planes, and all network switches employ self‑checking strategies.
Excerpt 3 – Dissimilar Redundancies
“While the four‑FCM primary system is robust, NASA must still account for common‑mode failures—software bugs or catastrophic events that could theoretically impact all primary channels simultaneously.”
To mitigate this, Orion carries a completely independent Backup Flight Software (BFS) system. This is a prime example of dissimilar redundancy: it is implemented on different hardware, runs a different operating system, and utilizes independently developed, simplified flight software.
Even in a total power loss scenario—called a “dead bus”—Orion is designed to survive. If power is restored, the spacecraft enters a safe mode, stabilizes, points its solar arrays at the Sun to recover power, orients its tail toward the Sun for thermal stability, and then attempts to re‑establish communication with Earth. During such a failure, the crew can manually configure life‑support systems or don space suits.
Takeaways
- Redundancy at multiple levels (processor, memory, networking) provides resilience against radiation‑induced errors.
- Deterministic error‑checking and automatic silencing/resetting keep the system synchronized.
- Dissimilar redundancy—different hardware and software stacks—protects against common‑mode failures.
While such extensive redundancy incurs significant cost, the Artemis II architecture offers valuable lessons for designing fault‑tolerant systems in other high‑reliability domains.