Shift-Left Reliability

Published: 3 weeks ago (January 12, 2026 at 04:34 PM EST)

5 min read

Source: Dev.to

The paradox

We’ve become exceptionally good at incident response. Modern teams restore service quickly, run thoughtful postmortems, and hold themselves accountable through corrective actions.

And yet…

A team ships a change that passes every test, gets all the required approvals, and still brings down checkout for 47 minutes. The postmortem conclusion? “We should have known our latency SLO was already at 94 % before deploying.”

Many postmortems point to the same root cause: changes we introduced ourselves. Not hardware failures. Not random outages. Just software behaving exactly as we told it to.

We continue to treat reliability as something to evaluate after those changes are already live. This isn’t a failure of tooling or process. It’s a question of when we decide whether a system is ready.

Shift‑Left Reliability

Where reliability decisions actually happen today

I’ve seen multiple teams running identical technology stacks with completely different SLOs, metrics, and alerts. Nobody told them what to implement, what’s best‑practice, or how to tune their alerts. They want to be good reliability citizens, but getting from the theory in the handbook to putting that theory into practice is not straightforward.

Services regularly move into production with SLOs being created months later—or never.
Dashboards are missing, insufficient, or inconsistent.
“Looks fine to me” during PR reviews.
Tribal knowledge and varying levels of understanding across teams.

Reliability is fundamentally bespoke and ungoverned. That’s the core issue.

The missing layer

GitHub gave us version control for code. Terraform gave us version control for infrastructure. Security has transformed with shift‑left—finding flaws as code is written, not after deployment.

We’re still missing version control for reliability.

We need a specification that:

Defines requirements.
Validates them against reality.
Generates the artifacts: dashboards, SLOs, alerts, escalation policies.

If the specification is validated and the artifacts created, the same tool can check in real‑time whether a service is in breach—and block high‑risk deployments in CI/CD.

What shift‑left reliability actually means

Shift‑left reliability doesn’t mean more alerts, more dashboards, more postmortems, or more people in the room. It means:

Spec – Define reliability requirements as code before production deployment.
Validate – Test those requirements against reality.
Enforce – Gate deployments through CI/CD.

Engineers don’t write PromQL or Grafana JSON—they declare intent, and reliability becomes deterministic. Outcomes are predictable, consistent, transparent, and follow best practice.

An executable reliability contract

Keep it simple. A team creates a service.yaml file with their reliability intent:

name: payment-api
tier: critical
type: api
team: payments
dependencies:
  - postgresql
  - redis

A complete service.yaml example can be found here.

Tooling validates metrics, SLOs, and error budgets then generates these artifacts automatically. This is the approach I am exploring with an open‑source project called NthLayer.

NthLayer runs in any CI/CD pipeline—GitHub Actions, ArgoCD, Jenkins, Tekton, GitLab CI. The goal isn’t to be an inflexible blocker; it’s to make risk visible and decisions explicit. Overrides are fine when they’re intentional, logged, and owned.

When a deployment is attempted, the specification is evaluated against reality:

$ nthlayer check-deploy -service payment-api
ERROR: Deployment blocked
 - availability SLO at 99.2 % (target: 99.95 %)
 - error budget exhausted: -47 minutes remaining
 - 3 P1 incidents in last 7 days

exit code: 2 (BLOCKED)

Why now?

SLOs have had 8+ years to mature and move from the Google SRE Handbook into mainstream practice. GitOps has normalized declarative configuration. Platform Engineering has matured as a discipline. The concepts are ready but the tooling has lagged behind.

This is a deliberate shift in approach. Reliability is no longer up for debate during incidents. Services have defined owners with deterministic standards. We can stop reinventing the reliability wheel every time a new service is onboarded. If requirements change, update the service.yaml, run NthLayer, and every service benefits from the new standard.

What this does not replace

NthLayer doesn’t replace service catalogs, developer portals, observability platforms, or incident management. It doesn’t predict failures or eliminate human judgment. It sits upstream of all these systems.

Goal: a reliability specification, automated deployment gates, and reduced cognitive load for implementing best practices.

Open questions

I don’t have all the answers, but two questions I keep returning to are:

Contract Drift: What happens when the spec says 99.95 % but reality has been 99.5 % for months? Is the contract wrong, or is the service broken?
Emergency Overrides: How should we handle urgent situations where a deployment must proceed despite a failed reliability check?

Feel free to share thoughts, experiences, or suggestions on how we can make shift‑left reliability a practical reality.

How should they work? Who approves? How do you prevent them from becoming the default?

The timing problem

Where do reliability decisions actually happen in your organization?
What would it look like to decide readiness before deployment?
What reliability rules do you wish you could enforce automatically?

The timing problem isn’t going away. The only question is whether you address it before deployment—or learn about it later in a post‑mortem.

NthLayer – open source, looking for early adopters

If you’re tired of reliability being an afterthought:

pip install nthlayer
nthlayer init
nthlayer check-deploy --service your-service

→ github.com/rsionnach/nthlayer

Star the repo, open an issue, or tell me I’m wrong. I want to hear how reliability decisions happen in your organization.

Rob Fox is a Senior Site Reliability Engineer focused on platform and reliability tooling. He’s exploring how reliability engineering can move earlier in the software delivery lifecycle. Find him on GitHub.