Building a Cost-Effective AutoML Platform on AWS: $10-25/month vs $150+ for SageMaker Endpoints

Published: (December 2, 2025 at 05:14 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

TL;DR

I built a serverless AutoML platform that trains ML models for ≈ $10‑25 / month. Upload a CSV, select the target column, and get a trained model back—no ML expertise required.

Prerequisites

  • AWS account with admin access
  • AWS CLI v2 (configured with aws configure)
  • Terraform ≥ 1.5
  • Docker (running)
  • Node.js 18+ and pnpm (frontend)
  • Python 3.11+ (local development)

Deployment

Estimated time: ~15 minutes from clone to a working platform.

git clone https://github.com/cristofima/AWS-AutoML-Lite.git
cd AWS-AutoML-Lite/infrastructure/terraform
terraform init && terraform apply

Why a Custom Solution?

AWS SageMaker Autopilot is powerful but costly for prototyping.

  • Free tier: 50 h/month of training for the first 2 months.
  • Real‑time inference endpoints (e.g., ml.c5.xlarge) run ~$150 / month 24/7.

The goal was a cheaper, serverless alternative for side projects.

Goals

GoalDesired outcome
Upload CSV → Trained model.pkl file
Auto‑detect problem typeClassification vs. regression
Automatic EDA reportsData profiling out‑of‑the‑box
CostUnder $25 / month

Architecture Overview

ComponentTechnologyReason
Backend APIFastAPI + MangumAsync, auto‑docs, Lambda‑ready
TrainingFLAML + scikit‑learnFast AutoML, production‑ready
FrontendNext.js 16 + TailwindSSR support via Amplify
InfrastructureTerraformReproducible, multi‑env
CI/CDGitHub Actions + OIDCNo stored AWS credentials

Problem‑type detection (auto)

# classification if:  **Note:** If you add a parameter to `train.py`, you must also add it to `container_overrides`.

Time budget per dataset size

RowsTime budget
 50 K20 min

Real‑time status

The frontend polls DynamoDB every 5 seconds to show training progress.

Automatic reports

  • EDA report – generated with data profiling libraries.
  • Training report – model performance metrics and feature importance.

IAM Policy (GitHub OIDC)

{
  "Statement": [
    {
      "Sid": "CoreServices",
      "Effect": "Allow",
      "Action": ["s3:*", "dynamodb:*", "lambda:*", "batch:*", "ecr:*"],
      "Resource": "arn:aws:*:*:*:automl-lite-*"
    },
    {
      "Sid": "APIGatewayAndAmplify",
      "Effect": "Allow",
      "Action": ["apigateway:*", "amplify:*"],
      "Resource": "*"
    },
    {
      "Sid": "IAMRoles",
      "Effect": "Allow",
      "Action": ["iam:*Role*", "iam:*RolePolicy*", "iam:PassRole"],
      "Resource": "arn:aws:iam::*:role/automl-lite-*"
    },
    {
      "Sid": "ServiceLinkedRoles",
      "Effect": "Allow",
      "Action": "iam:CreateServiceLinkedRole",
      "Resource": "arn:aws:iam::*:role/aws-service-role/*"
    },
    {
      "Sid": "Networking",
      "Effect": "Allow",
      "Action": ["ec2:Describe*", "ec2:*SecurityGroup*", "ec2:*Tags"],
      "Resource": "*"
    },
    {
      "Sid": "Logging",
      "Effect": "Allow",
      "Action": "logs:*",
      "Resource": "arn:aws:logs:*:*:log-group:/aws/*/automl-lite-*"
    }
  ]
}

CI/CD Flow

BranchAction
devAuto‑deploy to DEV
mainplan → manual approval → deploy to PROD

Deployment times (approx.)

TargetTime
Lambda only~2 min
Training container~3 min
Frontend~3 min
Full infrastructure~10 min

Monthly Cost Breakdown

ServiceCost
AWS Amplify$5‑15
Lambda + API Gateway$1‑2
Batch (Fargate Spot)$2‑5
S3 + DynamoDB$1‑2
Total$10‑25

Comparison with SageMaker

FeatureSageMaker AutopilotAWS AutoML Lite
Monthly cost~$150 + (real‑time endpoint)$10‑25
Setup time30 + min (Studio)~15 min
Portable models❌ locked to SageMaker✅ downloadable .pkl
ML expertiseMediumNone
Auto problem detection
EDA reports❌ manual✅ automatic
IaC❌ console‑heavy✅ full Terraform
Cold startN/A (always‑on)~200 ms (Lambda)
Best forProduction pipelinesPrototyping & side projects

Using the Trained Model

# Build prediction container
docker build -f scripts/Dockerfile.predict -t automl-predict .

# Show model info
docker run --rm -v ${PWD}:/data automl-predict /data/model.pkl --info

# Predict from CSV
docker run --rm -v ${PWD}:/data automl-predict \
  /data/model.pkl -i /data/test.csv -o /data/predictions.csv

Key observations

  • The 265 MB ML dependency footprint forced the Lambda ↔ Batch split.
  • Fargate Spot yields ~70 % savings; interruptions are rare for short jobs.
  • FLAML provides a smaller footprint and faster training than AutoGluon with comparable results.

Future Work (Roadmap)

  • ☐ ONNX export – deploy models to edge devices
  • ☐ Model comparison UI – train multiple models side‑by‑side
  • ☐ Real‑time updates via WebSocket (instead of polling)
  • ☐ Multi‑user support with Cognito authentication
  • ☐ Hyperparameter UI – fine‑tune FLAML settings from the frontend
  • ☐ Email notifications on training completion

Contributing

Contributions are welcome! Check the GitHub Issues for good first issues.

  • Repository: (link omitted)
Back to Blog

Related posts

Read more »