Solved: When do you decide to stop a PPC campaign?
Source: Dev.to
🚀 Executive Summary
TL;DR: Unidentified, costly “zombie microservices” (metaphorically, PPC campaigns burning cash) often consume significant cloud resources with unknown dependencies, leading to high bills and fear of shutdown. Safely decommission these services using methods like:
- Gradual resource reduction – “Strangle and Observe”
- Thorough dependency mapping – “Archaeological Dig”
- Controlled, reversible “Scream Test” during low‑traffic periods
🎯 Key Takeaways
-
Strangle and Observe – Cautiously reduce a service’s allocated resources (e.g., scale down EC2 instances or decrease cron frequency) and monitor system reactions and alerts to surface hidden dependencies while minimizing immediate risk.
-
Archaeological Dig – Use observability tools (DataDog, VPC flow logs, etc.) to map ingress/egress traffic and business functions, then create a formal deprecation plan and remove the associated infrastructure‑as‑code.
-
Scream Test – For services with zero documentation or logs, run a controlled test first in staging, then in production during low‑traffic windows with a ready rollback plan to identify critical dependencies by observing direct system failures.
Struggling with “zombie” services and legacy processes racking up your cloud bill? Learn when and how to safely decommission infrastructure without causing a production outage.
My ‘PPC Campaign’ is a Zombie Microservice: When to Pull the Plug
I remember staring at the monthly cloud bill. It was a five‑figure number that made my stomach turn, and one line item stood out: a fleet of massive EC2 instances under a service named DataAggregator-PROD. They were costing us nearly $4,000 a month, just humming along.
I asked around. The new product manager had never heard of it. The junior devs thought it was “some legacy thing we don’t touch.” It was a ghost in the machine, a technical PPC campaign burning cash with zero measurable ROI.
The problem? No one knew for sure what would happen if we turned it off. This is a story I’ve seen play out at nearly every company I’ve worked for.
The “Why”: How We Create These Digital Ghosts
This isn’t about blaming people. It’s a natural consequence of growth, changing priorities, and team turnover. A project that was critical two years ago gets superseded. The original developers move on. The documentation—if it ever existed—is now a dead link in a forgotten Confluence space. We end up with these zombie services for a few key reasons:
| Reason | Explanation |
|---|---|
| Fear of the Unknown | “What if this service quietly powers the checkout page and we cause a million‑dollar outage?” It’s easier to keep paying the bill than to risk being the one who broke production. |
| Lack of Ownership | When a service belongs to everyone, it belongs to no one. Without a clear owner responsible for its lifecycle, it’s destined to become technical debt. |
| Poor Observability | If you can’t easily see what’s calling a service and what that service is calling, you’re flying blind. You can’t confidently decommission something you can’t fully understand. |
So you’re stuck with an expensive, mysterious process. You know it’s probably useless, but the risk of shutting it down feels too high. Let’s walk through how we, in the trenches, actually solve this.
The Fixes: From Cautious Tweak to Calculated Gamble
1. The Quick Fix – ‘Strangle and Observe’ Method
This is my go‑to first step when political capital or time for a full investigation is low. It’s a bit hacky, but it’s effective. You don’t kill the service; you starve it. The goal is to make it cheap and see who screams.
- Auto‑scaling group – Scale the desired/min/max count down to one instance, using the smallest instance type possible.
- Data pipeline – Change its cron schedule from every hour to once a day at 3 AM.
The service is still “running,” which satisfies nervous stakeholders, but your costs plummet. Now, watch your monitoring dashboards like a hawk. Look for new error spikes in upstream or downstream services, check support ticket queues, and listen for whispers of “Hey, is the XYZ report running slow?”.
Pro Tip: Before you do this, make sure your alerting is top‑notch. If legacy‑api‑gw‑01 starts throwing 503 errors because its tiny instance is overwhelmed, you need to know immediately—not a day later when a customer complains.
2. The Permanent Fix – ‘Archaeological Dig’
This is the “right” way to do it. It takes time and effort but eliminates risk and cleans up technical debt properly. You become a detective, tracing the service’s digital footprint.
Your best friends here are your observability tools—think DataDog, New Relic, Honeycomb, or even deep‑diving into VPC flow logs and CloudWatch metrics. You need to answer three questions:
- Who calls this service? (Ingress traffic)
- What does this service call? (Egress traffic)
- What business function does it perform? (The “so what?”)
You’ll:
- Build a dependency map.
- Draft a formal deprecation plan.
- Communicate the plan to every team whose services interact with it.
- Schedule a decommission window and execute the removal of the associated infrastructure‑as‑code.
3. The Calculated Gamble – ‘Scream Test’
For services with zero documentation or logs, a controlled test can be executed:
- Stage the test – Disable or throttle the service in a staging environment. Verify that nothing critical breaks.
- Production low‑traffic window – Replicate the change in production during a known low‑traffic period (e.g., weekend night).
- Rollback plan – Have an instant “undo” (e.g., Terraform apply to restore resources, or a quick script to re‑enable the service).
Observe the system’s behavior. If no alerts fire and no tickets appear, you have evidence the service is truly orphaned and can be retired.
Closing Thoughts
Zombie microservices are the hidden cost of rapid growth. By applying a tiered approach—starting with a low‑risk “strangle,” moving to a thorough “archaeological dig,” and, when necessary, a controlled “scream test”—you can safely retire these cost‑centers without jeopardizing production.
Takeaway: Don’t let fear keep you paying for dead weight. Use observability, incremental reduction, and disciplined rollback plans to turn ghost services into clean, cost‑effective architecture. 🚀
The ‘Nuclear’ Option: The Scream Test
Let’s be honest. Sometimes you have zero documentation, zero logs, and zero time. The service is an opaque box, and the Archaeological Dig would take months. In these rare cases, you can perform a controlled “scream test.”
This is not a cowboy move. It is a calculated risk.
-
Staging environment – Shut the service down and leave it off for a full sprint. If the QA team doesn’t notice anything, you have your first piece of evidence.
-
Production – Plan it like a surgical strike:
- Announce a maintenance window during a low‑traffic period (e.g., Saturday at 2 AM).
- Have a rollback plan ready—a single command or button click to bring it back online.
- Shut the service down and then wait.
- If nothing happens after an hour, you can be reasonably confident.
- If nothing happens after a week, you can be very confident.
- If the BI team calls you three weeks later because their quarterly report failed, you have your answer. You can bring it back online temporarily and transition to the Permanent Fix method, now armed with a known dependency.
Warning: Use this option sparingly. It can burn trust if it goes wrong. But sometimes, it’s the only way to make progress on deeply‑entrenched technical debt and finally stop paying for ghosts.
Example Terraform Plan (The Goal)
# module "legacy_data_aggregator" {
# source = "./modules/ec2-cluster"
# ...
# }
# The above module will be removed in release v3.45.0 on 08/15.
# Ticket: DEVOPS-1234
# Reason: Service has been superseded by the 'realtime-metrics-api'.
# Contact: #devops-team on Slack

👉 Read the original article on TechResolve.blog
☕ Support my work
If this article helped you, you can buy me a coffee:
👉
