[Paper] Emergence-as-Code for Self-Governing Reliable Systems
Source: arXiv - 2602.05458v1
Overview
The paper introduces Emergence-as-Code (EmaC), a new paradigm for turning the reliability of end‑to‑end user journeys—such as “checkout p99 < 400 ms”—into a declarative, version‑controlled artifact. By linking high‑level journey intent to low‑level Service‑Level Objectives (SLOs) and live telemetry, EmaC makes reliability a computable, reviewable piece of code rather than an ad‑hoc spreadsheet.
Key Contributions
- Journey‑level reliability spec: A concise, Git‑trackable language that captures the desired user‑experience objective, control‑flow operators (e.g., retries, fallbacks), and permissible actions.
- Inference engine: Runtime component that consumes tracing data, traffic routing rules, and configuration to synthesize a candidate journey model with provenance and confidence scores.
- Compiler/controller pipeline: Transforms the accepted model into bounded journey‑SLOs and budget allocations under explicit correlation assumptions (optimistic independence vs. pessimistic shared‑fate).
- Control‑plane artifacts: Automatically generates burn‑rate alerts, rollout gates, and action guards that can be reviewed and merged via standard Git workflows.
- Artifact repository: An anonymized, runnable example that demonstrates the full spec‑to‑artifact lifecycle, enabling reproducibility and community experimentation.
Methodology
- Intent Declaration – Engineers write an EmaC spec that states the journey goal (e.g., “checkout latency p99 < 400 ms”), the logical flow (sequence of microservice calls, retries, circuit‑breakers), and any constraints on actions (e.g., “no external payment gateway fallback”).
- Telemetry Ingestion – The runtime inference service continuously pulls distributed tracing spans, service mesh routing tables, and SLO metrics from monitoring platforms (Prometheus, OpenTelemetry, etc.).
- Model Synthesis – Using the collected artifacts, the engine builds a probabilistic graph of the journey, annotating each edge with latency distributions, failure probabilities, and correlation tags. It also attaches a confidence level based on data freshness and coverage.
- Verification & Acceptance – The generated model is presented to developers for review. Once approved (via a pull request), it becomes the source of truth for the next steps.
- Compilation – The EmaC compiler applies user‑specified correlation assumptions to compute worst‑case latency budgets and error‑budget allocations for each hop, producing concrete SLOs (e.g., “service‑A latency ≤ 120 ms”).
- Control‑Plane Emission – The controller emits configuration for alerting (burn‑rate thresholds), CI/CD gates (preventing rollouts that would breach budgets), and runtime guards (circuit‑breaker policies). All artifacts are stored as code, enabling auditability and roll‑backs.
Results & Findings
- Accuracy – In a production‑grade microservice demo (≈ 30 services, 5 k RPS), the inferred journey model predicted p99 latency within ±8 % of observed values after a warm‑up period of 10 minutes.
- Budget Tightening – By exposing hidden tail‑amplification effects, teams were able to reduce over‑provisioned error budgets by ≈ 22 % without violating user‑experience goals.
- Release Safety – Automated rollout gates based on the generated burn‑rate alerts caught 3 out of 4 simulated failure injections that would have otherwise breached the checkout latency SLO.
- Developer Velocity – The Git‑centric workflow reduced the mean time to update a journey SLO from 2 weeks (manual spreadsheet process) to under 1 day.
Practical Implications
- Unified Reliability Ownership – Product teams can now own the end‑to‑end experience in the same repo where they store code, eliminating the “SLO‑to‑journey” translation gap.
- Safer Continuous Delivery – CI pipelines can automatically gate releases based on real‑time budget consumption, lowering the risk of regressions that only surface under load.
- Cost Optimization – Explicit correlation modeling helps identify when services share failure domains, allowing smarter redundancy strategies and avoiding unnecessary over‑provisioning.
- Observability‑as‑Code – By treating tracing and telemetry as inputs to a compiler, organizations can enforce consistent observability standards across services.
- Regulatory & SLA Audits – All reliability decisions are codified and versioned, simplifying compliance reporting and SLA negotiations with customers.
Limitations & Future Work
- Data Freshness Dependency – The inference accuracy hinges on low‑latency, high‑coverage tracing; sparse instrumentation can degrade confidence scores.
- Correlation Assumption Complexity – Choosing between optimistic independence and pessimistic shared‑fate models requires domain expertise; mis‑selection can lead to either over‑conservative or unsafe budgets.
- Scalability of Model Synthesis – While the prototype handled tens of services, scaling to hundreds of microservices with dynamic topologies may demand more efficient graph algorithms or sampling techniques.
- Tooling Integration – Current implementation is a standalone prototype; tighter integration with popular service meshes (Istio, Linkerd) and CI/CD platforms is planned.
- User‑Study Validation – Future work includes longitudinal studies with engineering teams to quantify the impact on reliability culture and incident reduction.
Authors
- Anatoly A. Krasnovsky
Paper Information
- arXiv ID: 2602.05458v1
- Categories: cs.SE, cs.DC, cs.PF, eess.SY
- Published: February 5, 2026
- PDF: Download PDF