[Paper] 'Where is My Troubleshooting Procedure?': Studying the Potential of RAG in Assisting Failure Resolution of Large Cyber-Physical System
Source: arXiv - 2601.08706v1
Overview
The paper investigates how Retrieval‑Augmented Generation (RAG) can be turned into a conversational assistant that helps operators quickly locate the right troubleshooting procedure from massive, natural‑language manuals of large cyber‑physical systems (CPS). Using real‑world data from Fincantieri’s naval platforms, the authors show that a RAG‑based tool can cut down the time needed to find relevant steps, but it also highlights the need for safeguards before any recommendation is executed.
Key Contributions
- Empirical study on RAG for CPS troubleshooting – first large‑scale evaluation on industrial naval manuals containing thousands of procedures.
- Design of a hybrid retrieval‑generation pipeline that combines dense vector search with a fine‑tuned language model to produce concise, context‑aware answers.
- User‑centric evaluation involving actual operators, measuring speed, accuracy, and perceived usefulness of the assistant.
- Guidelines for safe deployment, including cross‑validation mechanisms and confidence‑threshold heuristics to avoid blind execution of generated steps.
- Open dataset & benchmark (anonymized excerpts of the manuals) released for the research community to reproduce and extend the experiments.
Methodology
- Data preparation – The authors extracted 3,412 troubleshooting procedures from Fincantieri’s documentation, cleaned the text, and segmented it into procedure‑level chunks.
- Retrieval layer – A dense embedding model (based on SBERT) indexed the chunks, enabling fast similarity search given a symptom description.
- Generation layer – A GPT‑style decoder was fine‑tuned on a subset of the manual to rewrite retrieved snippets into concise, step‑by‑step instructions tailored to the operator’s query.
- Safety wrapper – Before presenting an answer, the system runs a rule‑based validator that checks for critical actions (e.g., power‑off, valve changes) against a whitelist and flags low‑confidence outputs.
- Evaluation – Two experiments were conducted: (a) offline metrics (Recall@k, BLEU, factual consistency) and (b) online user study with 12 seasoned operators who solved simulated fault scenarios using either the RAG assistant or the traditional manual search.
Results & Findings
| Metric | Traditional Search | RAG Assistant |
|---|---|---|
| Avg. time to first relevant step (seconds) | 112 ± 23 | 38 ± 12 |
| Correctness of selected procedure (% of cases) | 71% | 84% |
| Operator confidence (1‑5 Likert) | 3.2 | 4.4 |
| False‑positive recommendations (critical actions) | 0% (manual) | 2.3% (filtered) |
Key Takeaways
- The RAG tool reduced the “search‑and‑identify” phase by roughly 65 %, a huge win in time‑critical incidents.
- Accuracy improved, but a small fraction of generated answers still suggested unsafe actions, underscoring the importance of the validation layer.
- Operators reported that the conversational interface lowered cognitive load and made it easier to ask follow‑up “what if” questions.
Practical Implications
- Faster incident response – Deploying a RAG‑powered assistant in control rooms can shave minutes off fault diagnosis, potentially preventing costly downtime in shipyards, power plants, or manufacturing lines.
- Reduced training overhead – New engineers can rely on the assistant to navigate legacy documentation without memorizing every procedure.
- Integration pathways – The architecture can be wrapped around existing CMMS/SCADA systems via APIs, enabling seamless hand‑off from chatbot to execution platforms.
- Safety‑first deployment – The paper’s validation hooks (rule‑based checks, confidence thresholds) provide a blueprint for building “human‑in‑the‑loop” safeguards that satisfy regulatory standards.
Limitations & Future Work
- Domain specificity – The study focuses on naval CPS; results may differ for other sectors with distinct vocabularies or procedural structures.
- Limited multilingual support – Manuals were Italian‑centric; extending to multilingual corpora will require additional language models.
- Scalability of validation – Rule‑based cross‑checks work for a known set of critical actions but may struggle with novel procedures; future work could explore automated formal verification or reinforcement‑learning‑based safety nets.
- User study size – Only 12 operators participated; larger field trials are needed to confirm long‑term adoption and impact on real incidents.
Bottom line: RAG shows strong promise as a “smart search” layer for massive troubleshooting manuals, offering tangible speed and accuracy gains while reminding us that safety‑critical domains still demand rigorous validation before letting AI take the wheel.
Authors
- Maria Teresa Rossi
- Leonardo Mariani
- Oliviero Riganelli
- Giuseppe Filomento
- Danilo Giannone
- Paolo Gavazzo
Paper Information
- arXiv ID: 2601.08706v1
- Categories: cs.SE
- Published: January 13, 2026
- PDF: Download PDF