[Paper] Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices
Source: arXiv - 2601.02732v1
Overview
Microservice architectures now power many of today’s large‑scale applications, but their sheer size and inter‑dependency make failures hard to diagnose. This paper presents AMER‑RCL, a framework that combines recursive reasoning with an “agentic memory” to let large language models (LLMs) think more like seasoned Site Reliability Engineers (SREs). The authors show that the approach yields higher root‑cause localization accuracy while cutting inference latency.
Key Contributions
- Empirical SRE study – Interviews across several organizations uncovered three hallmarks of expert troubleshooting: recursive refinement, multi‑dimensional expansion, and cross‑modal reasoning.
- Recursive Reasoning Engine (RCL) – A multi‑agent LLM system that iteratively narrows down candidate causes for each alert, mimicking the step‑by‑step deduction SREs perform.
- Agentic Memory layer – A lightweight, time‑windowed store that captures reasoning traces from previously handled alerts and re‑uses them to avoid duplicated work.
- Comprehensive evaluation – Benchmarks on real‑world microservice failure datasets demonstrate consistent gains over prior graph‑based, deep‑learning, and LLM‑only baselines in both accuracy (up to +12 % F1) and latency (‑30 % average inference time).
- Open‑source prototype – The authors release a minimal implementation and a set of reproducible scripts, encouraging community adoption and further research.
Methodology
- Data collection & labeling – The team gathered alert logs, trace spans, and configuration snapshots from production microservice clusters, then had SREs annotate the true root causes.
- Agentic Memory design – A key‑value store indexed by alert signatures (e.g., service name, error pattern) retains the most recent reasoning steps (LLM prompts, intermediate hypotheses, and final verdict). The memory is refreshed every T minutes to keep context fresh.
- Recursive Reasoning loop
- Initialize with the raw alert.
- Generate hypotheses using an LLM (e.g., GPT‑4) prompted to consider service dependencies, recent deployments, and known failure modes.
- Validate each hypothesis by querying observability data (metrics, logs) via tool‑specific adapters.
- Prune low‑confidence candidates and feed the surviving ones back into the LLM for the next recursion round.
- Terminate when confidence exceeds a threshold or a maximum recursion depth is reached.
- Cross‑alert reuse – Before starting a new alert, the system checks Agentic Memory for similar past alerts; if a match is found, it injects the prior reasoning trace into the prompt, allowing the LLM to “stand on the shoulders” of earlier work.
- Training & fine‑tuning – The LLM is kept frozen; only prompt templates and few‑shot examples are tuned on the annotated dataset to keep the system lightweight and portable.
Results & Findings
| Metric | Graph‑Based Baseline | Deep‑Learning (GNN) | LLM‑Only | AMER‑RCL |
|---|---|---|---|---|
| F1‑Score (root cause) | 0.71 | 0.78 | 0.81 | 0.89 |
| Top‑3 Accuracy | 0.84 | 0.88 | 0.90 | 0.95 |
| Avg. Inference Latency (ms) | 420 | 350 | 610 | 430 |
| Redundant Reasoning (repeat prompts) | – | – | 1.8× per alert | 0.9× |
- Accuracy boost stems from the recursive refinement that eliminates spurious hypotheses early.
- Latency reduction is mainly due to Agentic Memory re‑using reasoning traces, cutting the number of LLM calls per alert by ~30 %.
- Ablation studies show that removing either the recursion or the memory component drops performance back to baseline levels, confirming their complementary roles.
Practical Implications
- Faster MTTR (Mean Time to Recovery) – By delivering more precise root‑cause suggestions quickly, SRE teams can remediate incidents with fewer manual investigations.
- Scalable observability pipelines – The memory layer works as a cheap cache; it can be integrated into existing alert‑routing tools (e.g., PagerDuty, Prometheus Alertmanager) without heavy compute overhead.
- Cross‑team knowledge sharing – The stored reasoning traces act as a living knowledge base, helping junior engineers learn from past incidents and reducing “tribal knowledge” loss.
- Vendor‑agnostic deployment – Since the LLM is accessed via API and the framework only needs adapters for metrics/logs, it can be dropped into any cloud‑native stack (Kubernetes, Service Meshes, etc.).
- Potential for automated remediation – With high‑confidence root causes, downstream automation (e.g., rollback, circuit‑breaker activation) can be safely triggered, moving from detection to self‑healing.
Limitations & Future Work
- Memory freshness trade‑off – The time window for Agentic Memory must balance relevance against storage cost; dynamic window sizing is left for future exploration.
- LLM dependency – The approach inherits the latency and cost characteristics of the underlying LLM service; offline fine‑tuning or distilled models could mitigate this.
- Generalization to non‑microservice domains – While the authors argue the methodology is transferable, validation on monolithic or edge‑computing environments remains open.
- Explainability – The recursive prompts generate intermediate hypotheses, but presenting them in a developer‑friendly UI is not covered. Future work could integrate visual reasoning traces.
Overall, AMER‑RCL bridges the gap between human‑like SRE reasoning and automated LLM inference, offering a practical path toward more reliable microservice operations.
Authors
- Lingzhe Zhang
- Tong Jia
- Yunpeng Zhai
- Leyi Pan
- Chiming Duan
- Minghua He
- Mengxi Jia
- Ying Li
Paper Information
- arXiv ID: 2601.02732v1
- Categories: cs.SE, cs.AI
- Published: January 6, 2026
- PDF: Download PDF