[Paper] Using Large Language Models to Support Automation of Failure Management in CI/CD Pipelines: A Case Study in SAP HANA
Source: arXiv - 2602.06709v1
Overview
The paper investigates whether large language models (LLMs) can reliably automate the detection and fixing of CI/CD pipeline failures in a real‑world, enterprise‑scale project—SAP HANA. By feeding the LLM with various kinds of domain knowledge, the authors show that it can pinpoint the failing component and suggest precise, actionable fixes far better than a naïve LLM.
Key Contributions
- LLM‑driven failure management prototype for a production‑grade CI/CD pipeline (SAP HANA).
- Systematic evaluation of three knowledge sources: pipeline metadata, explicit failure‑management instructions, and a repository of historical failure cases.
- Ablation study quantifying the impact of each knowledge source on location‑identification and solution‑generation accuracy.
- Empirical results demonstrating 97.4 % error‑location accuracy (vs. 84.2 % without domain knowledge) and 92.1 % exact‑solution rate when historical failure data are included.
- Practical guidelines for integrating LLMs into existing DevOps toolchains.
Methodology
-
Data Collection – The authors extracted 1,200 CI/CD failure instances from SAP HANA’s build pipeline, each annotated with:
- The failing step (error location).
- A human‑written remediation instruction.
- Contextual metadata (e.g., affected module, test suite).
-
Knowledge Injection – Three “knowledge packs” were prepared:
- Pipeline Info – Structured data about the CI/CD stages and artifact dependencies.
- Management Instructions – A curated set of rule‑based guidelines used by SAP engineers.
- Historical Failures – A searchable archive of past failure logs and their resolved solutions.
-
LLM Prompt Engineering – A state‑of‑the‑art LLM (GPT‑4‑style) was prompted with the failure log plus one or more knowledge packs. The prompt asked the model to (a) locate the error and (b) output a minimal, executable fix.
-
Ablation Experiments – The system was run under four configurations:
- No external knowledge (baseline).
- Only pipeline info.
- Only management instructions.
- Only historical failures.
- All three combined.
-
Evaluation Metrics –
- Location Accuracy – Correct identification of the failing pipeline stage.
- Solution Exactness – Whether the suggested fix matches the human‑validated solution without superfluous steps.
Results & Findings
| Configuration | Error‑Location Accuracy | Exact‑Solution Rate |
|---|---|---|
| Baseline (no knowledge) | 84.2 % | 68.5 % |
| Pipeline info only | 89.1 % | 75.3 % |
| Management instructions only | 90.4 % | 78.9 % |
| Historical failures only | 97.4 % | 92.1 % |
| All knowledge packs combined | 96.8 % | 91.4 % |
- Historical failure data dominate: They provide concrete patterns that the LLM can match, dramatically boosting both location and solution accuracy.
- Marginal gains from combining all sources suggest diminishing returns once a rich failure archive is available.
- The LLM consistently generated minimal fixes—no extra steps, no “best‑practice” fluff—making the output ready for automated execution.
Practical Implications
- Automated Triage Bots: Teams can embed an LLM‑powered assistant into their CI/CD dashboards to instantly surface the root cause and a ready‑to‑run fix, cutting mean‑time‑to‑recovery (MTTR) by minutes or hours.
- Knowledge‑Base Leverage: Companies that already maintain a searchable log of past build failures can unlock immediate ROI by feeding that archive to an LLM, rather than building custom rule engines.
- Scalable DevOps: The approach scales with the size of the failure archive; as more incidents are logged, the model’s precision improves, creating a virtuous feedback loop.
- Integration Simplicity: Because the solution is generated as plain text commands or configuration snippets, it can be piped directly into existing automation tools (e.g., Jenkins, GitHub Actions) without extensive API work.
- Reduced On‑Call Fatigue: Junior engineers or on‑call staff can rely on the assistant for first‑line diagnostics, freeing senior staff for higher‑impact work.
Limitations & Future Work
- Domain Specificity: The study focuses on SAP HANA; results may differ for pipelines with less structured logs or for languages/frameworks not represented in the historical archive.
- Model Hallucination Risk: Although exactness was high, occasional hallucinated commands could still cause failures; a verification step (e.g., sandbox execution) is advisable.
- Knowledge Maintenance: Keeping the historical failure repository up‑to‑date requires disciplined logging and curation—an operational overhead not covered in the paper.
- Scalability of Prompt Size: Very large knowledge packs may exceed token limits of current LLM APIs; future work could explore retrieval‑augmented generation or vector‑based similarity search to keep prompts concise.
- Broader Metrics: The authors measured accuracy but not downstream business impact (e.g., cost savings, developer satisfaction). Extending evaluation to these dimensions is a natural next step.
Authors
- Duong Bui
- Stefan Grintz
- Alexander Berndt
- Thomas Bach
Paper Information
- arXiv ID: 2602.06709v1
- Categories: cs.SE
- Published: February 6, 2026
- PDF: Download PDF