[Paper] FC-ADL: Efficient Microservice Anomaly Detection and Localisation Through Functional Connectivity
Source: arXiv - 2512.00844v1
Overview
Microservice‑based systems are everywhere—from cloud‑native apps to massive e‑commerce platforms—but their distributed nature makes it hard to spot and pinpoint failures quickly. The paper FC‑ADL: Efficient Microservice Anomaly Detection and Localisation Through Functional Connectivity introduces a novel, low‑overhead technique that treats the relationships between service metrics like brain activity patterns, enabling fast detection and root‑cause suggestion even in huge deployments.
Key Contributions
- Functional‑Connectivity‑Based Model: Adapts a neuroscience concept to capture time‑varying inter‑service metric dependencies without expensive causal inference.
- End‑to‑End Detection & Localisation Pipeline (FC‑ADL): Simultaneously flags anomalous behavior and produces a ranked list of likely faulty services.
- Scalability Demonstrated on Real‑World Scale: Tested on Alibaba’s massive microservice fabric (tens of thousands of services) with linear‑time performance.
- Empirical Superiority: Beats state‑of‑the‑art anomaly detectors and fault‑localisers across a broad set of synthetic and real fault scenarios.
- Open‑Source‑Ready Design: Uses only standard metric streams (CPU, latency, request counts) and lightweight graph‑based computations, making integration into existing observability stacks straightforward.
Methodology
- Metric Collection – Continuous streams of per‑service telemetry (e.g., latency percentiles, error rates) are ingested.
- Sliding‑Window Correlation – For each window, the algorithm computes pairwise Pearson correlations between all service metrics, forming a functional connectivity matrix that reflects how services move together over time.
- Change‑Point Detection – The matrix is compared to a baseline (e.g., exponentially weighted moving average). Significant deviations trigger an anomaly flag.
- Root‑Cause Scoring – When an anomaly is detected, the method evaluates which nodes (services) contributed most to the matrix change using a simple influence score derived from the magnitude of correlation shifts.
- Ranking & Alerting – Services are ranked by influence score; the top‑k are presented to operators as root‑cause candidates.
All steps rely on linear‑time operations (matrix updates are incremental) and avoid combinatorial causal searches, keeping CPU and memory footprints low enough for production use.
Results & Findings
| Evaluation | Metric | FC‑ADL | Best Prior Art |
|---|---|---|---|
| Synthetic fault injection (10‑100 services) | Detection F1‑score | 0.93 | 0.78 |
| Real‑world Alibaba trace (≈ 30 k services) | Localization Top‑3 accuracy | 0.87 | 0.61 |
| Throughput impact | CPU overhead per 1 k services | < 2 % | 5‑12 % |
| Latency to raise an alert | Median detection latency | ≈ 30 s | 120 s |
Key takeaways
- The functional‑connectivity signal captures subtle, system‑wide drifts that single‑metric thresholds miss.
- Localization quality remains high even when multiple services are simultaneously degraded.
- The approach scales linearly; adding more services does not explode computation time.
Practical Implications
- Plug‑and‑Play Anomaly Service – Teams can drop FC‑ADL into existing Prometheus/Grafana pipelines, leveraging already‑collected metrics.
- Faster MTTR – By delivering a ranked list of suspect services within seconds, on‑call engineers can triage incidents more efficiently, reducing mean time to resolution.
- Cost‑Effective Observability – No need for heavyweight tracing or distributed causal inference engines, which often require additional instrumentation and storage.
- Proactive Capacity Planning – Continuous functional‑connectivity maps can reveal emerging coupling patterns, helping architects refactor overly‑tight service dependencies before they cause outages.
- Vendor‑Neutral – Works with any cloud provider or orchestration platform (Kubernetes, Nomad, etc.) as long as metric streams are available.
Limitations & Future Work
- Metric Diversity – The current implementation focuses on scalar performance metrics; richer logs or traces are not directly incorporated.
- Assumption of Linear Correlation – Pearson correlation may miss non‑linear relationships; future extensions could explore mutual information or kernel‑based measures.
- Cold‑Start Baseline – Accurate baselines need a stable observation period; highly volatile workloads may require adaptive baseline strategies.
- Root‑Cause Granularity – While FC‑ADL surfaces candidate services, pinpointing the exact code path still needs complementary debugging tools.
The authors suggest exploring hybrid models that fuse functional connectivity with lightweight causal graphs, and extending the framework to handle multi‑tenant environments where metric isolation is a concern.
Authors
- Giles Winchester
- George Parisis
- Luc Berthouze
Paper Information
- arXiv ID: 2512.00844v1
- Categories: cs.SE, cs.DC, cs.LG
- Published: November 30, 2025
- PDF: Download PDF