Zero-to-Scale ML: Deploying ONNX Models on Kubernetes with FastAPI and HPA

Published: 9 hours ago (December 15, 2025 at 01:42 PM EST)

2 min read

Source: Dev.to

Cover image for Zero-to-Scale ML: Deploying ONNX Models on Kubernetes with FastAPI and HPA

The path to scalable ML deployment requires high‑performance APIs and robust orchestration. This post walks through setting up a local, highly available, and auto‑scaling inference service using FastAPI for speed and Kind for Kubernetes orchestration.

Phase 1: The FastAPI Inference Service

Our Python service handles ONNX model inference. The critical component for K8s stability is the /health endpoint:

# app.py snippet
# ... model loading logic ...

@app.get("/health")
def health_check():
    # K8s Probes will hit this endpoint frequently
    return {"status": "ok", "model_loaded": True}

# ... /predict endpoint ...

Phase 2: Docker and Kubernetes Deployment

After building the image (clothing-classifier:latest) and loading it into Kind, we define the Deployment. Note the crucial resource constraints and probes.

# deployment.yaml (Snippet focusing on probes and resources)
resources:
  requests:
    cpu: "250m"   # For scheduling
    memory: "500Mi"
  limits:
    cpu: "500m"   # To prevent monopolizing the node
    memory: "1Gi"
livenessProbe:
  httpGet: {path: /health, port: 8000}
  initialDelaySeconds: 5
readinessProbe:
  httpGet: {path: /health, port: 8000}
  initialDelaySeconds: 5  # Gives time for the ONNX model to load

Phase 3: Implementing Horizontal Pod Autoscaler (HPA)

Scalability is handled by the HPA, which requires the Metrics Server to be running.

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: clothing-classifier-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: clothing-classifier-deployment
  minReplicas: 2
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50  # Scale up if CPU exceeds 50%

Result: Under load, the HPA dynamically adjusts replica count. This is the definition of elastic, cost‑effective MLOps.

Read the full guide here.

Zero-to-Scale ML: Deploying ONNX Models on Kubernetes with FastAPI and HPA

Phase 1: The FastAPI Inference Service

Phase 2: Docker and Kubernetes Deployment

Phase 3: Implementing Horizontal Pod Autoscaler (HPA)

Related posts

Backing Up Nginx Logs the Right Way: From Basics to Automation

Docker Internals Deep Dive: What Really Happens When You Run docker run (2025 Edition)

Why I Don’t Chase Virality: And Focus on Long-Term Value

Circle Acquires Axelar Core Team: Decoding the Stablecoin Giant’s Platform Strategy