[Paper] EES-CND: Collaborative Neural Decision-Making for Drift-Aware Fault-Tolerant Edge-Cloud Service Placement
Source: arXiv - 2606.02259v1
Overview
The paper presents EES‑CND, a novel fault‑tolerant service‑placement framework for edge‑cloud systems. By letting several tiny neural networks collaborate and evolve together, the approach can instantly re‑allocate services when hardware or software failures occur, while keeping performance drift under control.
Key Contributions
- Collaborative Neural Decision‑Making (CND): Introduces a lightweight ensemble of neural nets that jointly decide where to redeploy services during a failure, reducing the latency of a single heavyweight model.
- Enhanced Evolution Strategy (EES): An online, drift‑aware evolutionary optimizer that continuously fine‑tunes the neural ensemble to reflect changing workloads and resource conditions.
- Drift‑Aware Fault Tolerance: Combines CND with EES to detect and compensate for performance drift caused by dynamic edge‑cloud environments, improving reliability without excessive re‑training costs.
- Comprehensive Evaluation: Shows up to 44.8 % lower fault‑tolerance cost and significant gains in recovery time, response time, and overall reliability compared with state‑of‑the‑art placement algorithms.
Methodology
- System Model: The edge‑cloud infrastructure is abstracted as a set of heterogeneous nodes (edge devices, micro‑data‑centers, central cloud) with varying compute, storage, and network characteristics.
- Neural Ensemble: Instead of a monolithic predictor, the authors deploy N lightweight feed‑forward networks (each trained on a different slice of historical placement data). During a failure, the networks vote on the best new placement, and a simple aggregation rule produces the final decision.
- Enhanced Evolution Strategy:
- Population: Each neural network’s weight vector is treated as an individual in an evolutionary population.
- Mutation & Crossover: Standard ES operators are applied, but with an adaptive step‑size that grows when performance drift is detected (e.g., sudden latency spikes) and shrinks when the system stabilizes.
- Online Update: The ES runs continuously in the background, using real‑time metrics (CPU load, network latency, SLA violations) as fitness signals, so the ensemble stays aligned with the current operating conditions.
- Fault‑Aware Placement Loop: When a node fails, the orchestrator triggers the CND ensemble, obtains a redeployment plan, and instantly migrates the affected services. The ES then refines the ensemble based on the outcome, closing the feedback loop.
Results & Findings
| Metric | EES‑CND vs. Baseline (stand‑alone NN) | Improvement |
|---|---|---|
| Service Recovery Time | 1.2 s vs. 2.1 s | ≈ 43 % faster |
| Average Response Time (post‑recovery) | 85 ms vs. 112 ms | ≈ 24 % lower |
| Reliability (SLA breach rate) | 0.8 % vs. 2.3 % | ≈ 65 % reduction |
| Fault‑Tolerance Cost (resource + migration overhead) | 0.56 × baseline | 44.8 % cheaper |
The experiments, conducted on a simulated edge‑cloud testbed with realistic workload traces (IoT analytics, video streaming, AR gaming), demonstrate that the collaborative ensemble reacts faster to failures and adapts more gracefully to workload shifts than a single static model.
Practical Implications
- Edge‑First Applications: Developers of latency‑sensitive services (AR/VR, autonomous drones, real‑time analytics) can rely on EES‑CND to keep their functions running even when edge nodes go offline, without sacrificing QoS.
- Reduced Ops Overhead: Because the neural ensemble is lightweight and evolves online, operators avoid costly offline retraining cycles and can deploy the solution on resource‑constrained edge gateways.
- SLA‑Driven Autoscaling: Cloud‑native platforms (Kubernetes, OpenYurt) can integrate the CND decision engine as a custom scheduler plugin, automatically triggering migrations that respect SLA budgets.
- Cost Savings: The 44 % reduction in fault‑tolerance cost translates directly into lower bandwidth usage for state transfer and fewer extra compute instances, which is attractive for telco edge deployments and multi‑tenant edge clouds.
Limitations & Future Work
- Simulation‑Based Validation: The study relies on synthetic failure patterns and simulated network conditions; real‑world deployments could expose additional latency sources (e.g., container start‑up time).
- Scalability of Ensemble Size: While the authors show benefits with a modest number of neural nets, the trade‑off between ensemble size and decision latency in massive edge clusters remains open.
- Security Considerations: The framework assumes trustworthy nodes; future work could explore how to harden the collaborative decision process against adversarial attacks or compromised edge devices.
- Integration with Existing Orchestrators: The paper outlines a custom placement loop; extending the approach to standard orchestration APIs (K8s scheduler extensions, OpenStack) is a natural next step.
EES‑CND offers a compelling blend of AI‑driven adaptability and evolutionary robustness, paving the way for more resilient edge‑cloud services that can keep up with the rapid pace of modern distributed applications.
Authors
- Mohammadsadeq Garshasbi Herabad
- Javid Taheri
- Bestoun S. Ahmed
- Calin Curescu
Paper Information
- arXiv ID: 2606.02259v1
- Categories: cs.DC
- Published: June 1, 2026
- PDF: Download PDF