Building a Cost-Effective AutoML Platform on AWS: $10-25/month vs $150+ for SageMaker Endpoints
Source: Dev.to
TL;DR
I built a serverless AutoML platform that trains ML models for ≈ $10‑25 / month. Upload a CSV, select the target column, and get a trained model back—no ML expertise required.
Prerequisites
- AWS account with admin access
- AWS CLI v2 (configured with
aws configure) - Terraform ≥ 1.5
- Docker (running)
- Node.js 18+ and pnpm (frontend)
- Python 3.11+ (local development)
Deployment
Estimated time: ~15 minutes from clone to a working platform.
git clone https://github.com/cristofima/AWS-AutoML-Lite.git
cd AWS-AutoML-Lite/infrastructure/terraform
terraform init && terraform apply
Why a Custom Solution?
AWS SageMaker Autopilot is powerful but costly for prototyping.
- Free tier: 50 h/month of training for the first 2 months.
- Real‑time inference endpoints (e.g.,
ml.c5.xlarge) run ~$150 / month 24/7.
The goal was a cheaper, serverless alternative for side projects.
Goals
| Goal | Desired outcome |
|---|---|
| Upload CSV → Trained model | .pkl file |
| Auto‑detect problem type | Classification vs. regression |
| Automatic EDA reports | Data profiling out‑of‑the‑box |
| Cost | Under $25 / month |
Architecture Overview
| Component | Technology | Reason |
|---|---|---|
| Backend API | FastAPI + Mangum | Async, auto‑docs, Lambda‑ready |
| Training | FLAML + scikit‑learn | Fast AutoML, production‑ready |
| Frontend | Next.js 16 + Tailwind | SSR support via Amplify |
| Infrastructure | Terraform | Reproducible, multi‑env |
| CI/CD | GitHub Actions + OIDC | No stored AWS credentials |
Problem‑type detection (auto)
# classification if: **Note:** If you add a parameter to `train.py`, you must also add it to `container_overrides`.
Time budget per dataset size
| Rows | Time budget |
|---|---|
50 K | 20 min |
Real‑time status
The frontend polls DynamoDB every 5 seconds to show training progress.
Automatic reports
- EDA report – generated with data profiling libraries.
- Training report – model performance metrics and feature importance.
IAM Policy (GitHub OIDC)
{
"Statement": [
{
"Sid": "CoreServices",
"Effect": "Allow",
"Action": ["s3:*", "dynamodb:*", "lambda:*", "batch:*", "ecr:*"],
"Resource": "arn:aws:*:*:*:automl-lite-*"
},
{
"Sid": "APIGatewayAndAmplify",
"Effect": "Allow",
"Action": ["apigateway:*", "amplify:*"],
"Resource": "*"
},
{
"Sid": "IAMRoles",
"Effect": "Allow",
"Action": ["iam:*Role*", "iam:*RolePolicy*", "iam:PassRole"],
"Resource": "arn:aws:iam::*:role/automl-lite-*"
},
{
"Sid": "ServiceLinkedRoles",
"Effect": "Allow",
"Action": "iam:CreateServiceLinkedRole",
"Resource": "arn:aws:iam::*:role/aws-service-role/*"
},
{
"Sid": "Networking",
"Effect": "Allow",
"Action": ["ec2:Describe*", "ec2:*SecurityGroup*", "ec2:*Tags"],
"Resource": "*"
},
{
"Sid": "Logging",
"Effect": "Allow",
"Action": "logs:*",
"Resource": "arn:aws:logs:*:*:log-group:/aws/*/automl-lite-*"
}
]
}
CI/CD Flow
| Branch | Action |
|---|---|
dev | Auto‑deploy to DEV |
main | plan → manual approval → deploy to PROD |
Deployment times (approx.)
| Target | Time |
|---|---|
| Lambda only | ~2 min |
| Training container | ~3 min |
| Frontend | ~3 min |
| Full infrastructure | ~10 min |
Monthly Cost Breakdown
| Service | Cost |
|---|---|
| AWS Amplify | $5‑15 |
| Lambda + API Gateway | $1‑2 |
| Batch (Fargate Spot) | $2‑5 |
| S3 + DynamoDB | $1‑2 |
| Total | $10‑25 |
Comparison with SageMaker
| Feature | SageMaker Autopilot | AWS AutoML Lite |
|---|---|---|
| Monthly cost | ~$150 + (real‑time endpoint) | $10‑25 |
| Setup time | 30 + min (Studio) | ~15 min |
| Portable models | ❌ locked to SageMaker | ✅ downloadable .pkl |
| ML expertise | Medium | None |
| Auto problem detection | ✅ | ✅ |
| EDA reports | ❌ manual | ✅ automatic |
| IaC | ❌ console‑heavy | ✅ full Terraform |
| Cold start | N/A (always‑on) | ~200 ms (Lambda) |
| Best for | Production pipelines | Prototyping & side projects |
Using the Trained Model
# Build prediction container
docker build -f scripts/Dockerfile.predict -t automl-predict .
# Show model info
docker run --rm -v ${PWD}:/data automl-predict /data/model.pkl --info
# Predict from CSV
docker run --rm -v ${PWD}:/data automl-predict \
/data/model.pkl -i /data/test.csv -o /data/predictions.csv
Key observations
- The 265 MB ML dependency footprint forced the Lambda ↔ Batch split.
- Fargate Spot yields ~70 % savings; interruptions are rare for short jobs.
- FLAML provides a smaller footprint and faster training than AutoGluon with comparable results.
Future Work (Roadmap)
- ☐ ONNX export – deploy models to edge devices
- ☐ Model comparison UI – train multiple models side‑by‑side
- ☐ Real‑time updates via WebSocket (instead of polling)
- ☐ Multi‑user support with Cognito authentication
- ☐ Hyperparameter UI – fine‑tune FLAML settings from the frontend
- ☐ Email notifications on training completion
Contributing
Contributions are welcome! Check the GitHub Issues for good first issues.
- Repository: (link omitted)