[Paper] Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing
Source: arXiv - 2604.16280v1
Overview
The paper proposes a hybrid XAI pipeline that couples a domain‑specific Knowledge Graph (KG) with a Large Language Model (LLM) to turn raw machine‑learning outputs into clear, context‑aware explanations for manufacturing operators. By storing both sensor‑level data and the model’s predictions in a structured graph, the system can retrieve the most relevant facts and let an LLM “talk” about them in plain language, dramatically improving interpretability in a real‑world factory setting.
Key Contributions
- KG‑augmented XAI framework: Introduces a method for persisting ML results together with domain knowledge as graph triples, creating a unified source of truth for explanations.
- Selective KG retrieval for LLM prompting: Designs a lightweight retrieval algorithm that extracts only the most pertinent triples before feeding them to an LLM, keeping prompts short and cost‑effective.
- Manufacturing‑focused evaluation: Uses the XAI Question Bank plus 15 custom, industry‑specific questions (total 33) to benchmark the approach on accuracy, consistency, clarity, and usefulness.
- Empirical evidence of decision‑support gains: Shows that explanations generated through the KG‑LLM pipeline lead to higher operator confidence and faster root‑cause analysis in a pilot production line.
- Open‑source reference implementation: Provides code and a small KG schema that can be adapted to other manufacturing domains or even other industrial sectors.
Methodology
- Data & Model Integration – Sensor streams, process parameters, and the outputs of a predictive maintenance model are ingested into a Neo4j‑style KG. Each prediction is linked to the raw features that contributed most (e.g., via SHAP values).
- Triplet Selection – For a given user query, a rule‑based selector (feature importance > threshold + temporal relevance) pulls a handful of triples (typically 5‑10) that capture the “why” behind the prediction.
- Prompt Construction – The selected triples are formatted as natural‑language statements (e.g., “Machine X showed a temperature rise of +12 °C in the last 30 min”) and concatenated with a concise instruction for the LLM (e.g., “Explain why the model predicts a failure in the next hour”).
- LLM Generation – A commercial LLM (GPT‑4‑Turbo) processes the prompt and returns a paragraph‑long explanation. Post‑processing removes jargon and adds actionable recommendations.
- Evaluation – Answers are compared against a ground‑truth set from the XAI Question Bank. Quantitative metrics (accuracy, consistency) are computed automatically, while a panel of 8 manufacturing engineers rates clarity and usefulness on a 5‑point Likert scale.
Results & Findings
| Metric | KG‑LLM (proposed) | Baseline LLM‑only | Baseline SHAP‑text |
|---|---|---|---|
| Accuracy (correct answer) | 92 % | 71 % | 68 % |
| Consistency (same answer on re‑ask) | 94 % | 78 % | 75 % |
| Clarity (avg. rating) | 4.6 / 5 | 3.8 / 5 | 3.5 / 5 |
| Usefulness (avg. rating) | 4.5 / 5 | 3.6 / 5 | 3.2 / 5 |
- The KG‑LLM pipeline answered all 33 questions with higher factual correctness than a vanilla LLM that only sees raw feature values.
- Operators reported a 30 % reduction in time to diagnose a predicted failure, attributing the speed‑up to the contextual grounding provided by the KG.
- Qualitative feedback highlighted that the explanations felt “grounded in the plant’s own language” and avoided the “black‑box feel” of typical SHAP plots.
Practical Implications
- Faster root‑cause analysis: Maintenance teams can act on predictions without digging through raw sensor logs or separate SHAP visualizations.
- Lower training overhead: New operators can understand model outputs through natural language, reducing the need for specialized XAI training.
- Scalable to other domains: The retrieval‑prompt pattern works with any LLM and any graph database, making it a reusable component for quality control, energy management, or supply‑chain forecasting.
- Cost‑effective deployment: By limiting the prompt to a handful of triples, token usage stays low (≈ 150 tokens per query), keeping API costs manageable even in high‑throughput environments.
- Compliance & auditability: Storing explanations as linked graph triples creates a traceable provenance trail, useful for regulatory reporting in safety‑critical manufacturing.
Limitations & Future Work
- KG maintenance burden: Keeping the graph up‑to‑date with evolving sensor suites and process changes requires dedicated data‑engineering effort.
- Domain‑specific prompt engineering: The current selector and prompt templates were hand‑tuned for a single factory; generalizing them may need automated prompt‑optimization techniques.
- LLM hallucination risk: Although the KG grounding reduces hallucinations, occasional fabrications were observed when the retrieved triples were sparse.
- Future directions include: (1) learning‑based retrieval models that adapt to user feedback, (2) integrating multimodal data (e.g., images from inspection cameras) into the KG, and (3) evaluating the approach on larger, multi‑plant deployments to test scalability and robustness.
Authors
- Thomas Bayer
- Alexander Lohr
- Sarah Weiß
- Bernd Michelberger
- Wolfram Höpken
Paper Information
- arXiv ID: 2604.16280v1
- Categories: cs.AI
- Published: April 17, 2026
- PDF: Download PDF