[Paper] MAS-FIRE: Fault Injection and Reliability Evaluation for LLM-Based Multi-Agent Systems
Source: arXiv - 2602.19843v1
Overview
Large Language Model (LLM)‑driven multi‑agent systems (MAS) are emerging as a powerful way to tackle complex, distributed tasks—think autonomous research assistants, coordinated code generators, or AI‑powered operations teams. However, because these agents talk to each other using free‑form natural language instead of strict APIs, subtle “semantic” bugs (hallucinations, mis‑interpreted instructions, reasoning drift) can silently cascade and break the whole workflow. The paper MAS‑FIRE introduces a systematic fault‑injection framework that lets engineers deliberately inject and study these failures, giving the first fine‑grained view of where MAS succeed or fall apart.
Key Contributions
- Fault taxonomy – 15 distinct fault types covering both intra‑agent cognitive errors (e.g., hallucination, reasoning drift) and inter‑agent coordination mishaps (e.g., message loss, routing swaps).
- Non‑invasive injection mechanisms – three practical ways to corrupt a running MAS without touching its source code:
- Prompt modification (tamper with the system or user prompt)
- Response rewriting (post‑process LLM outputs)
- Message routing manipulation (shuffle or drop messages between agents)
- Tiered fault‑tolerance taxonomy – a four‑level hierarchy (mechanism → rule → prompt → reasoning) that explains how a MAS recovers from a fault.
- Empirical evaluation – applied MAS‑FIRE to three representative MAS architectures (linear pipeline, hierarchical manager‑worker, and iterative closed‑loop) and uncovered systematic patterns of robustness and fragility.
- Insights on model vs. architecture – stronger base LLMs do not guarantee better reliability; topology (e.g., iterative feedback loops) can neutralize >40 % of catastrophic faults that cripple linear pipelines.
Methodology
- Define fault space – The authors catalogued 15 realistic failure modes observed in real‑world deployments (e.g., “agent hallucinates a nonexistent file”, “manager misroutes a sub‑task”).
- Inject faults – Using the three injection hooks, they programmatically altered prompts, rewrote LLM responses, or shuffled the message graph while the MAS executed its original task. This required no code changes to the agents themselves, making the approach usable on black‑box services (OpenAI, Anthropic, etc.).
- Run benchmark tasks – Three canonical MAS workloads were used:
- (a) a multi‑step data‑analysis pipeline
- (b) a hierarchical planning‑execution scenario
- (c) an iterative self‑debugging code‑generation loop
- Observe and classify outcomes – For each injected fault they recorded whether the system:
- (i) collapsed (task failed outright)
- (ii) recovered via a higher‑level mechanism (e.g., a manager re‑issues a prompt)
- (iii) degraded (partial success)
The results were mapped onto the four‑tier fault‑tolerance model. The whole pipeline is automated: define a fault, pick an injection point, run the MAS, and collect logs that expose process‑level observability (who said what, when, and how the system reacted).
Results & Findings
| Architecture | Faults causing total collapse | Faults mitigated by built‑in recovery | Key takeaways |
|---|---|---|---|
| Linear pipeline | 12 / 15 | 2 | No feedback loop → errors propagate unchecked. |
| Hierarchical manager‑worker | 7 / 15 | 6 | Manager can re‑prompt or re‑assign tasks, halving failures. |
| Iterative closed‑loop | 5 / 15 | 10 | Self‑debugging loop detects inconsistencies and retries, neutralizing >40 % of faults. |
- Model size matters, but not uniformly – Switching from GPT‑3.5‑turbo to GPT‑4 reduced hallucination‑type faults by ~15 % but increased reasoning‑drift failures because the larger model generated more elaborate (and thus more fragile) reasoning chains.
- Topology trumps raw capability – The iterative design outperformed the stronger model baseline, showing that architectural safeguards (e.g., verification steps, looped feedback) are a more reliable path to robustness.
- Tiered recovery patterns – Most successful recoveries happened at the prompt tier (re‑issuing a clarified instruction) or rule tier (manager applying a sanity‑check rule). Pure reasoning‑level fixes were rare, suggesting that explicit guardrails are more effective than hoping the LLM will self‑correct.
Practical Implications
- Design‑for‑observability – When building MAS, expose the message graph (who talks to whom, what was said) so that fault‑injection tools like MAS‑FIRE can hook in. Simple logging middleware is enough.
- Add verification layers – Implement lightweight rule‑checkers (e.g., “does the file referenced actually exist?”) at the rule tier to catch hallucinations before they cascade.
- Iterative feedback loops – Even a modest “ask‑again‑if‑uncertain” sub‑routine can dramatically improve resilience; consider adding a “self‑audit” step after each agent’s output.
- Don’t rely on bigger LLMs alone – Investing in a more capable model may not pay off unless you also redesign the coordination topology.
- Automated reliability testing – MAS‑FIRE can be integrated into CI pipelines: inject a suite of faults, run the MAS on a representative task, and assert that failure rates stay below a threshold before a release.
Limitations & Future Work
- Scope of fault types – The 15‑fault taxonomy, while comprehensive, is still handcrafted; exotic domain‑specific failures (e.g., security‑policy violations) are not covered.
- Evaluation breadth – Experiments were limited to three MAS prototypes and a handful of benchmark tasks; broader industry‑scale workloads (e.g., multi‑agent DevOps orchestration) remain to be studied.
- Non‑invasive injection overhead – The current hooks add latency (prompt rewriting, message shuffling) that could affect time‑sensitive systems; future work could explore more efficient injection points.
- Automated mitigation synthesis – The paper classifies existing recovery behaviors but does not yet propose automated generation of new guardrails; extending MAS‑FIRE to suggest rule or prompt fixes is an open direction.
Authors
- Jin Jia
- Zhiling Deng
- Zhuangbin Chen
- Yingqi Wang
- Zibin Zheng
Paper Information
- arXiv ID: 2602.19843v1
- Categories: cs.SE, cs.AI
- Published: February 23, 2026
- PDF: Download PDF