Solved: Help us understand FinOps maturity & cloud cost challenges
Source: Dev.to
TL;DR: Cloud cost overruns stem from poor visibility and lack of ownership, exemplified by forgotten highâcost instances. The solution involves a multiâpronged FinOps approach, combining automated cleanup scripts, proactive policyâasâcode guardrails, and fundamental organizational shifts toward showback and chargeback for sustained financial accountability.
Core Recommendations
- Implement âJanitorâ scripts (e.g., AWS Lambda) to automatically identify and terminate untagged or abandoned cloud resources â a reactive costâcontrol measure.
- Enforce âPolicy as Codeâ using tools like Sentinel, Open Policy Agent (OPA), or Service Control Policies (SCPs) to prevent expensive or untagged resource provisioning at the IaC or AWS Organization level.
- Drive Organizational Change through FinOps practices such as showback (displaying teamâspecific cloud spend) and chargeback (allocating costs to team budgets) to foster a culture of financial ownership.
The Problem
âStruggling with runaway cloud costs and immature FinOps practices? This guide, from a Senior DevOps Engineer, breaks down the real reasons for cloud waste and offers three concrete solutions, from quick scripts to permanent cultural shifts, to get your spending under control.â
I still remember the Mondayâmorning Slack message from Finance:
âDarian, can you explain this AWS spike?â
Opening the billing console, my stomach dropped. A developer had spun up a p4d.24xlarge EC2 instance on Friday afternoon for a âquick testâ of a new ML model and then forgot about it. Over a single weekend that instance generated a fiveâfigure bill.
We had no guardrails, alerts, or ownership policies. It was a freeâforâall, and we were paying for itâliterally.
This isnât a unique story. Teams are handed the keys to the cloud kingdom with immense power to innovate, but without the financial literacy or guardrails to do it responsibly. Thatâs the core of the FinOps maturity struggle. Itâs not about being cheap; itâs about being efficient and accountable.
Root Causes
| Issue | Description |
|---|---|
| Lack of Visibility | Engineers canât see the cost of the infrastructure theyâre provisioning in realâtime. terraform apply doesnât show a price tag. Billing is an abstract concept dealt with weeks later. |
| Lack of Ownership | When no one is directly accountable for a resource (e.g., devâtestâdataâprocessingâclusterâ04), no one has an incentive to shut it down. It becomes âthe companyâs infrastructure,â a shared problem. |
Fixing this isnât just about finding zombie servers. Itâs about fundamentally changing how your teams interact with the cloud.
Solution #1 â Reactive âStop the Bleedingâ (Janitor Scripts)
âThis is the reactive, âstop the bleedingâ approach. Youâre not fixing the culture, but you are stopping the immediate waste.â
We built a simple AWS Lambda function, triggered nightly by EventBridge, that:
- Scans all EC2 instances and RDS databases in our dev accounts.
- Flags resources missing an owner tag or a TTL (TimeâToâLive) tag.
- Posts a warning to a Slack channel, tagging the creator (if identifiable via CloudTrail).
- If the resource remains untagged after 24âŻhours, a second Lambda terminates it.
Result: Harsh? Yes. Effective? Absolutely.
Sample Python (Boto3) â Lambda Janitor
import boto3
def find_untagged_instances(event, context):
ec2 = boto3.client('ec2', region_name='us-east-1')
instances = ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
tags = instance.get('Tags', [])
tag_keys = [tag['Key'] for tag in tags]
if 'owner' not in tag_keys:
print(f"ALERT: Instance {instance_id} is missing 'owner' tag.")
# In a real script, you'd post this to Slack or SNS
# and maybe add a "pending_termination" tagWarning: This is a hack, not a strategy. It cleans up the mess but doesnât teach anyone not to make one. Youâll spend time maintaining the script and dealing with angry developers whose âimportant test serverâ got terminated. Use it to gain initial control, but donât stop here.
Solution #2 â âShiftâLeftâ Prevention (PolicyâasâCode)
âThis is where you âshift leftâ and prevent the problem from happening in the first place. Instead of cleaning up messes, you make it impossible to create them.â
Core Principle
Embed cost controls directly into your IaC pipeline and cloud account structure.
Mandatory Tagging with IaC Policies
- Tools: Sentinel (Terraform Cloud), OPA integrated into CI/CD.
- Policy Example: Fail a
terraform planif a resource lacks anownertag or if an S3 bucket lacks a lifecycle policy. - Outcome: Developers receive immediate feedback before anything is deployed.
Service Control Policies (SCPs)
Apply SCPs at the AWS Organization level to developer accounts. SCPs act as âIAM policies on steroids,â allowing you to deny the creation of specific instance families.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyExpensiveInstanceTypesInDev",
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "*",
"Condition": {
"StringEquals": {
"ec2:InstanceType": [
"p4d.24xlarge",
"g5.12xlarge",
"g5.24xlarge"
]
}
}
}
]
}- UseâCase: Block all
p4,g5, etc., instance types in any account that isnât the designated âML Researchâ OU.
Solution #3 â Organizational Change (FinOps Practices)
âDrive âOrganizational Changeâ through FinOps practices like âshowbackâ and âchargebackâ to foster a culture of financial ownership.â
Steps to Implement
- Showback â Publish weekly/monthly dashboards that break down cloud spend by team, project, or tag.
- Chargeback â Allocate actual costs to each teamâs budget, making overspend a direct responsibility.
- FinOps Council â Form a crossâfunctional group (Engineering, Finance, Product) to review spend, set budgets, and refine policies.
- Education & Training â Run regular workshops on cloud pricing, costâeffective architecture patterns, and tagging standards.
Expected Benefits
| Benefit | Description |
|---|---|
| Transparency | Teams see the financial impact of their decisions in nearârealâtime. |
| Accountability | Ownership is assigned; teams are incentivized to optimize. |
| Continuous Improvement | Regular reviews surface new waste patterns and drive policy updates. |
Putting It All Together
| Phase | Action | Owner |
|---|---|---|
| 1ď¸âŁ Reactive | Deploy Lambda Janitor + Slack alerts. | Cloud Ops / SRE |
| 2ď¸âŁ Preventive | Implement Sentinel/OPA policies & SCPs. | Platform Engineering |
| 3ď¸âŁ Cultural | Roll out showback/chargeback dashboards, form FinOps council, run training. | Finance + Engineering Leadership |
Bottom line: Start with the quick win (janitor script) to halt immediate waste, then lock down provisioning with policyâasâcode, and finally embed financial responsibility into the organizationâs DNA. This threeâlayered approach moves you from âfireâfightingâ to âfinanciallyâsmart cloud engineering.â
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"StringLike": {
"ec2:InstanceType": [
"p4d.*",
"p3.*",
"g5.*",
"x2iezn.*",
"u-12tb1.metal"
]
}
}
}
]
}Approaches Overview
| Approach | Effort | Time to Implement | LongâTerm Impact |
|---|---|---|---|
| 1. The Janitor Script | Low | Days | Low (Reactive) |
| 2. Policy & Guardrails | Medium | Weeks | High (Proactive) |
| 3. Organizational Change | High | Months/Quarters | Transformational |
Ultimately, a mature FinOps practice uses a combination of all three:
- Janitor script for what slips through.
- Guardrails to prevent most issues.
- Cultural ownership to make everyone a responsible steward of cloud resources.
Stop chasing surprise bills and start building a platform that makes financial responsibility the path of least resistance.
đ Read the original article on TechResolve.blog
â Support my work â If this article helped you, you can buy me a coffee: đ