LGTM != Production Ready: Why your CI pipeline is missing the most important step

Published: (February 4, 2026 at 05:18 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Cover image for LGTM != Production Ready: Why your CI pipeline is missing the most important step

Overview

We have linters for syntax and scanners for security. It’s time we started linting for “will this wake me up at 3 AM?”

You submit a Pull Request. The CI passes green. Your colleagues review it and comment “LGTM!” The code is merged and deployed.

Three days later, at 3:17 AM, PagerDuty fires.

The root cause wasn’t a syntax error. It wasn’t a logic bug that unit tests could catch. It was something subtle: an HTTP client missing a timeout setting that caused a cascading failure when a downstream service hiccuped.

The “Senior Intuition” Gap

Why did the PR review miss that missing timeout?

  • Standard code reviews usually focus on code style, logic correctness, and maintainability.
  • Standard static‑analysis tools catch syntax errors or obvious security flaws.

Nobody is actively grepping for operational maturity. Usually, catching these latent failure modes relies on “senior engineer intuition”—the Spidey‑Sense that a battle‑hardened SRE develops after years of being woken up on‑call. They glance at code and immediately think: “Where is the back‑off retry logic?” or “This unchecked environment variable is going to brick boot‑up one day.”

The problem is that senior intuition doesn’t scale. You can’t clone your best SRE to review every PR.

The Failure of Checklists

Many organizations try to solve this with a “Production Readiness Checklist” — a dusty Confluence page with dozens of questions like “Have you considered failure domains?”

Let’s be honest: nobody uses these. They are manual, tedious, done at the very end of the development lifecycle (too late to change architecture), and are usually just “checkbox theatre” to appease management.

If it’s not automated, it doesn’t exist.

Shifting “Operability” Left

We need to treat operational requirements the same way we treat code style. If gofmt fails, you can’t merge. If your code contains latent operational risks, you shouldn’t be able to merge either.

I wanted a tool that codified that “senior engineer intuition” into something executable—a scanner that doesn’t care about variable names, but cares deeply about whether the application will survive a partial network outage.

Since I couldn’t find one that fit my needs, I built it.

Introducing Production‑Readiness

Production‑Readiness is an open‑source, opinionated scanner designed to detect operational blind spots before they hit production.

  • It’s not a replacement for Prometheus or Datadog.
  • It’s a pre‑flight check that looks for patterns that appear “correct” in code but cause fires in production.

What Does It Actually Catch?

Unlike a standard linter, this tool looks for semantic operational patterns. Here are a few examples of rules that are difficult to catch with standard regex grep:

1. The “Hanging Client” Trap

  • Risk: Instantiating an HTTP client with infinite timeouts ties up all your threads/goroutines, eventually crashing the service.
  • Fix: The scanner flags clients initialized without explicit timeouts.

2. Missing Graceful Shutdown

  • Risk: When Kubernetes scales down a pod, it sends a SIGTERM. If the app doesn’t handle it and finish in‑flight requests, you drop user traffic during every deployment or auto‑scale event.
  • Fix: The scanner looks for signal‑handling wiring.

3. Unvalidated Configuration

  • Risk: The app needs an API_KEY environment variable to start. You deploy, and it crash‑loops because the variable is missing in the new environment.
  • Fix: The scanner checks if critical configuration inputs have validation checks associated with them.

Seeing It in Action

The tool is written in Go and packaged as a single binary you can drop into any CI pipeline.

$ pr scan ./my-microservice

It produces a prioritized report of operational risks.

Production‑Readiness scan example

Conclusion

The gap between “code that works” and “code that runs reliably in production” is massive. We need to stop relying on hope and manual checklists to bridge that gap.

By automating the detection of operational anti‑patterns, we can ship faster and, more importantly, sleep better.

The project is open source and we are just getting started defining the rules of what makes software “production‑ready.”

If you’ve ever been burned by a “silly” configuration mistake in prod, give it a spin.

👉 Star the repo on GitHub

Back to Blog

Related posts

Read more »