Solved: Help us understand FinOps maturity & cloud cost challenges
Source: Dev.to
TL;DR: Cloud cost overruns stem from poor visibility and lack of ownership, exemplified by forgotten high‑cost instances. The solution involves a multi‑pronged FinOps approach, combining automated cleanup scripts, proactive policy‑as‑code guardrails, and fundamental organizational shifts toward showback and chargeback for sustained financial accountability.
Core Recommendations
- Implement “Janitor” scripts (e.g., AWS Lambda) to automatically identify and terminate untagged or abandoned cloud resources – a reactive cost‑control measure.
- Enforce “Policy as Code” using tools like Sentinel, Open Policy Agent (OPA), or Service Control Policies (SCPs) to prevent expensive or untagged resource provisioning at the IaC or AWS Organization level.
- Drive Organizational Change through FinOps practices such as showback (displaying team‑specific cloud spend) and chargeback (allocating costs to team budgets) to foster a culture of financial ownership.
The Problem
“Struggling with runaway cloud costs and immature FinOps practices? This guide, from a Senior DevOps Engineer, breaks down the real reasons for cloud waste and offers three concrete solutions, from quick scripts to permanent cultural shifts, to get your spending under control.”
I still remember the Monday‑morning Slack message from Finance:
“Darian, can you explain this AWS spike?”
Opening the billing console, my stomach dropped. A developer had spun up a p4d.24xlarge EC2 instance on Friday afternoon for a “quick test” of a new ML model and then forgot about it. Over a single weekend that instance generated a five‑figure bill.
We had no guardrails, alerts, or ownership policies. It was a free‑for‑all, and we were paying for it—literally.
This isn’t a unique story. Teams are handed the keys to the cloud kingdom with immense power to innovate, but without the financial literacy or guardrails to do it responsibly. That’s the core of the FinOps maturity struggle. It’s not about being cheap; it’s about being efficient and accountable.
Root Causes
| Issue | Description |
|---|---|
| Lack of Visibility | Engineers can’t see the cost of the infrastructure they’re provisioning in real‑time. terraform apply doesn’t show a price tag. Billing is an abstract concept dealt with weeks later. |
| Lack of Ownership | When no one is directly accountable for a resource (e.g., dev‑test‑data‑processing‑cluster‑04), no one has an incentive to shut it down. It becomes “the company’s infrastructure,” a shared problem. |
Fixing this isn’t just about finding zombie servers. It’s about fundamentally changing how your teams interact with the cloud.
Solution #1 – Reactive “Stop the Bleeding” (Janitor Scripts)
“This is the reactive, ‘stop the bleeding’ approach. You’re not fixing the culture, but you are stopping the immediate waste.”
We built a simple AWS Lambda function, triggered nightly by EventBridge, that:
- Scans all EC2 instances and RDS databases in our dev accounts.
- Flags resources missing an owner tag or a TTL (Time‑To‑Live) tag.
- Posts a warning to a Slack channel, tagging the creator (if identifiable via CloudTrail).
- If the resource remains untagged after 24 hours, a second Lambda terminates it.
Result: Harsh? Yes. Effective? Absolutely.
Sample Python (Boto3) – Lambda Janitor
import boto3
def find_untagged_instances(event, context):
ec2 = boto3.client('ec2', region_name='us-east-1')
instances = ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
tags = instance.get('Tags', [])
tag_keys = [tag['Key'] for tag in tags]
if 'owner' not in tag_keys:
print(f"ALERT: Instance {instance_id} is missing 'owner' tag.")
# In a real script, you'd post this to Slack or SNS
# and maybe add a "pending_termination" tag
Warning: This is a hack, not a strategy. It cleans up the mess but doesn’t teach anyone not to make one. You’ll spend time maintaining the script and dealing with angry developers whose “important test server” got terminated. Use it to gain initial control, but don’t stop here.
Solution #2 – “Shift‑Left” Prevention (Policy‑as‑Code)
“This is where you ‘shift left’ and prevent the problem from happening in the first place. Instead of cleaning up messes, you make it impossible to create them.”
Core Principle
Embed cost controls directly into your IaC pipeline and cloud account structure.
Mandatory Tagging with IaC Policies
- Tools: Sentinel (Terraform Cloud), OPA integrated into CI/CD.
- Policy Example: Fail a
terraform planif a resource lacks anownertag or if an S3 bucket lacks a lifecycle policy. - Outcome: Developers receive immediate feedback before anything is deployed.
Service Control Policies (SCPs)
Apply SCPs at the AWS Organization level to developer accounts. SCPs act as “IAM policies on steroids,” allowing you to deny the creation of specific instance families.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyExpensiveInstanceTypesInDev",
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "*",
"Condition": {
"StringEquals": {
"ec2:InstanceType": [
"p4d.24xlarge",
"g5.12xlarge",
"g5.24xlarge"
]
}
}
}
]
}
- Use‑Case: Block all
p4,g5, etc., instance types in any account that isn’t the designated “ML Research” OU.
Solution #3 – Organizational Change (FinOps Practices)
“Drive ‘Organizational Change’ through FinOps practices like ‘showback’ and ‘chargeback’ to foster a culture of financial ownership.”
Steps to Implement
- Showback – Publish weekly/monthly dashboards that break down cloud spend by team, project, or tag.
- Chargeback – Allocate actual costs to each team’s budget, making overspend a direct responsibility.
- FinOps Council – Form a cross‑functional group (Engineering, Finance, Product) to review spend, set budgets, and refine policies.
- Education & Training – Run regular workshops on cloud pricing, cost‑effective architecture patterns, and tagging standards.
Expected Benefits
| Benefit | Description |
|---|---|
| Transparency | Teams see the financial impact of their decisions in near‑real‑time. |
| Accountability | Ownership is assigned; teams are incentivized to optimize. |
| Continuous Improvement | Regular reviews surface new waste patterns and drive policy updates. |
Putting It All Together
| Phase | Action | Owner |
|---|---|---|
| 1️⃣ Reactive | Deploy Lambda Janitor + Slack alerts. | Cloud Ops / SRE |
| 2️⃣ Preventive | Implement Sentinel/OPA policies & SCPs. | Platform Engineering |
| 3️⃣ Cultural | Roll out showback/chargeback dashboards, form FinOps council, run training. | Finance + Engineering Leadership |
Bottom line: Start with the quick win (janitor script) to halt immediate waste, then lock down provisioning with policy‑as‑code, and finally embed financial responsibility into the organization’s DNA. This three‑layered approach moves you from “fire‑fighting” to “financially‑smart cloud engineering.”
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"StringLike": {
"ec2:InstanceType": [
"p4d.*",
"p3.*",
"g5.*",
"x2iezn.*",
"u-12tb1.metal"
]
}
}
}
]
}
Approaches Overview
| Approach | Effort | Time to Implement | Long‑Term Impact |
|---|---|---|---|
| 1. The Janitor Script | Low | Days | Low (Reactive) |
| 2. Policy & Guardrails | Medium | Weeks | High (Proactive) |
| 3. Organizational Change | High | Months/Quarters | Transformational |
Ultimately, a mature FinOps practice uses a combination of all three:
- Janitor script for what slips through.
- Guardrails to prevent most issues.
- Cultural ownership to make everyone a responsible steward of cloud resources.
Stop chasing surprise bills and start building a platform that makes financial responsibility the path of least resistance.
👉 Read the original article on TechResolve.blog
☕ Support my work – If this article helped you, you can buy me a coffee: 👉