Building a Cost-Effective AutoML Platform on AWS: $10-25/month vs $150+ for SageMaker Endpoints

Published: 2 months ago (December 2, 2025 at 05:14 PM EST)

3 min read

Source: Dev.to

TL;DR

I built a serverless AutoML platform that trains ML models for ≈ $10‑25 / month. Upload a CSV, select the target column, and get a trained model back—no ML expertise required.

Prerequisites

AWS account with admin access
AWS CLI v2 (configured with aws configure)
Terraform ≥ 1.5
Docker (running)
Node.js 18+ and pnpm (frontend)
Python 3.11+ (local development)

Deployment

Estimated time: ~15 minutes from clone to a working platform.

git clone https://github.com/cristofima/AWS-AutoML-Lite.git
cd AWS-AutoML-Lite/infrastructure/terraform
terraform init && terraform apply

Why a Custom Solution?

AWS SageMaker Autopilot is powerful but costly for prototyping.

Free tier: 50 h/month of training for the first 2 months.
Real‑time inference endpoints (e.g., ml.c5.xlarge) run ~$150 / month 24/7.

The goal was a cheaper, serverless alternative for side projects.

Goals

Goal	Desired outcome
Upload CSV → Trained model	`.pkl` file
Auto‑detect problem type	Classification vs. regression
Automatic EDA reports	Data profiling out‑of‑the‑box
Cost	Under $25 / month

Architecture Overview

Component	Technology	Reason
Backend API	FastAPI + Mangum	Async, auto‑docs, Lambda‑ready
Training	FLAML + scikit‑learn	Fast AutoML, production‑ready
Frontend	Next.js 16 + Tailwind	SSR support via Amplify
Infrastructure	Terraform	Reproducible, multi‑env
CI/CD	GitHub Actions + OIDC	No stored AWS credentials

Problem‑type detection (auto)

# classification if:  **Note:** If you add a parameter to `train.py`, you must also add it to `container_overrides`.

Time budget per dataset size

Rows	Time budget
`50 K`	20 min

Real‑time status

The frontend polls DynamoDB every 5 seconds to show training progress.

Automatic reports

EDA report – generated with data profiling libraries.
Training report – model performance metrics and feature importance.

IAM Policy (GitHub OIDC)

{
  "Statement": [
    {
      "Sid": "CoreServices",
      "Effect": "Allow",
      "Action": ["s3:*", "dynamodb:*", "lambda:*", "batch:*", "ecr:*"],
      "Resource": "arn:aws:*:*:*:automl-lite-*"
    },
    {
      "Sid": "APIGatewayAndAmplify",
      "Effect": "Allow",
      "Action": ["apigateway:*", "amplify:*"],
      "Resource": "*"
    },
    {
      "Sid": "IAMRoles",
      "Effect": "Allow",
      "Action": ["iam:*Role*", "iam:*RolePolicy*", "iam:PassRole"],
      "Resource": "arn:aws:iam::*:role/automl-lite-*"
    },
    {
      "Sid": "ServiceLinkedRoles",
      "Effect": "Allow",
      "Action": "iam:CreateServiceLinkedRole",
      "Resource": "arn:aws:iam::*:role/aws-service-role/*"
    },
    {
      "Sid": "Networking",
      "Effect": "Allow",
      "Action": ["ec2:Describe*", "ec2:*SecurityGroup*", "ec2:*Tags"],
      "Resource": "*"
    },
    {
      "Sid": "Logging",
      "Effect": "Allow",
      "Action": "logs:*",
      "Resource": "arn:aws:logs:*:*:log-group:/aws/*/automl-lite-*"
    }
  ]
}

CI/CD Flow

Branch	Action
`dev`	Auto‑deploy to DEV
`main`	`plan → manual approval → deploy to PROD`

Deployment times (approx.)

Target	Time
Lambda only	~2 min
Training container	~3 min
Frontend	~3 min
Full infrastructure	~10 min

Monthly Cost Breakdown

Service	Cost
AWS Amplify	$5‑15
Lambda + API Gateway	$1‑2
Batch (Fargate Spot)	$2‑5
S3 + DynamoDB	$1‑2
Total	$10‑25

Comparison with SageMaker

Feature	SageMaker Autopilot	AWS AutoML Lite
Monthly cost	~$150 + (real‑time endpoint)	$10‑25
Setup time	30 + min (Studio)	~15 min
Portable models	❌ locked to SageMaker	✅ downloadable `.pkl`
ML expertise	Medium	None
Auto problem detection	✅	✅
EDA reports	❌ manual	✅ automatic
IaC	❌ console‑heavy	✅ full Terraform
Cold start	N/A (always‑on)	~200 ms (Lambda)
Best for	Production pipelines	Prototyping & side projects

Using the Trained Model

# Build prediction container
docker build -f scripts/Dockerfile.predict -t automl-predict .

# Show model info
docker run --rm -v ${PWD}:/data automl-predict /data/model.pkl --info

# Predict from CSV
docker run --rm -v ${PWD}:/data automl-predict \
  /data/model.pkl -i /data/test.csv -o /data/predictions.csv

Key observations

The 265 MB ML dependency footprint forced the Lambda ↔ Batch split.
Fargate Spot yields ~70 % savings; interruptions are rare for short jobs.
FLAML provides a smaller footprint and faster training than AutoGluon with comparable results.

Future Work (Roadmap)

☐ ONNX export – deploy models to edge devices
☐ Model comparison UI – train multiple models side‑by‑side
☐ Real‑time updates via WebSocket (instead of polling)
☐ Multi‑user support with Cognito authentication
☐ Hyperparameter UI – fine‑tune FLAML settings from the frontend
☐ Email notifications on training completion

Contributing

Contributions are welcome! Check the GitHub Issues for good first issues.

Repository: (link omitted)