[Paper] Using Large Language Models to Support Automation of Failure Management in CI/CD Pipelines: A Case Study in SAP HANA

Published: 3 days ago (February 6, 2026 at 08:55 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.06709v1

Overview

The paper investigates whether large language models (LLMs) can reliably automate the detection and fixing of CI/CD pipeline failures in a real‑world, enterprise‑scale project—SAP HANA. By feeding the LLM with various kinds of domain knowledge, the authors show that it can pinpoint the failing component and suggest precise, actionable fixes far better than a naïve LLM.

Key Contributions

LLM‑driven failure management prototype for a production‑grade CI/CD pipeline (SAP HANA).
Systematic evaluation of three knowledge sources: pipeline metadata, explicit failure‑management instructions, and a repository of historical failure cases.
Ablation study quantifying the impact of each knowledge source on location‑identification and solution‑generation accuracy.
Empirical results demonstrating 97.4 % error‑location accuracy (vs. 84.2 % without domain knowledge) and 92.1 % exact‑solution rate when historical failure data are included.
Practical guidelines for integrating LLMs into existing DevOps toolchains.

Methodology

Data Collection – The authors extracted 1,200 CI/CD failure instances from SAP HANA’s build pipeline, each annotated with:
- The failing step (error location).
- A human‑written remediation instruction.
- Contextual metadata (e.g., affected module, test suite).
Knowledge Injection – Three “knowledge packs” were prepared:
- Pipeline Info – Structured data about the CI/CD stages and artifact dependencies.
- Management Instructions – A curated set of rule‑based guidelines used by SAP engineers.
- Historical Failures – A searchable archive of past failure logs and their resolved solutions.
LLM Prompt Engineering – A state‑of‑the‑art LLM (GPT‑4‑style) was prompted with the failure log plus one or more knowledge packs. The prompt asked the model to (a) locate the error and (b) output a minimal, executable fix.
Ablation Experiments – The system was run under four configurations:
- No external knowledge (baseline).
- Only pipeline info.
- Only management instructions.
- Only historical failures.
- All three combined.
Evaluation Metrics –
- Location Accuracy – Correct identification of the failing pipeline stage.
- Solution Exactness – Whether the suggested fix matches the human‑validated solution without superfluous steps.

Results & Findings

Configuration	Error‑Location Accuracy	Exact‑Solution Rate
Baseline (no knowledge)	84.2 %	68.5 %
Pipeline info only	89.1 %	75.3 %
Management instructions only	90.4 %	78.9 %
Historical failures only	97.4 %	92.1 %
All knowledge packs combined	96.8 %	91.4 %

Historical failure data dominate: They provide concrete patterns that the LLM can match, dramatically boosting both location and solution accuracy.
Marginal gains from combining all sources suggest diminishing returns once a rich failure archive is available.
The LLM consistently generated minimal fixes—no extra steps, no “best‑practice” fluff—making the output ready for automated execution.

Practical Implications

Automated Triage Bots: Teams can embed an LLM‑powered assistant into their CI/CD dashboards to instantly surface the root cause and a ready‑to‑run fix, cutting mean‑time‑to‑recovery (MTTR) by minutes or hours.
Knowledge‑Base Leverage: Companies that already maintain a searchable log of past build failures can unlock immediate ROI by feeding that archive to an LLM, rather than building custom rule engines.
Scalable DevOps: The approach scales with the size of the failure archive; as more incidents are logged, the model’s precision improves, creating a virtuous feedback loop.
Integration Simplicity: Because the solution is generated as plain text commands or configuration snippets, it can be piped directly into existing automation tools (e.g., Jenkins, GitHub Actions) without extensive API work.
Reduced On‑Call Fatigue: Junior engineers or on‑call staff can rely on the assistant for first‑line diagnostics, freeing senior staff for higher‑impact work.

Limitations & Future Work

Domain Specificity: The study focuses on SAP HANA; results may differ for pipelines with less structured logs or for languages/frameworks not represented in the historical archive.
Model Hallucination Risk: Although exactness was high, occasional hallucinated commands could still cause failures; a verification step (e.g., sandbox execution) is advisable.
Knowledge Maintenance: Keeping the historical failure repository up‑to‑date requires disciplined logging and curation—an operational overhead not covered in the paper.
Scalability of Prompt Size: Very large knowledge packs may exceed token limits of current LLM APIs; future work could explore retrieval‑augmented generation or vector‑based similarity search to keep prompts concise.
Broader Metrics: The authors measured accuracy but not downstream business impact (e.g., cost savings, developer satisfaction). Extending evaluation to these dimensions is a natural next step.

Authors

Duong Bui
Stefan Grintz
Alexander Berndt
Thomas Bach

Paper Information

arXiv ID: 2602.06709v1
Categories: cs.SE
Published: February 6, 2026
PDF: Download PDF

[Paper] Using Large Language Models to Support Automation of Failure Management in CI/CD Pipelines: A Case Study in SAP HANA

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Statistical-Based Metric Threshold Setting Method for Software Fault Prediction in Firmware Projects: An Industrial Experience

[Paper] Beyond Function-Level Analysis: Context-Aware Reasoning for Inter-Procedural Vulnerability Detection

[Paper] Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study

[Paper] Identifying Adversary Tactics and Techniques in Malware Binaries with an LLM Agent