Zero-to-Scale ML: Deploying ONNX Models on Kubernetes with FastAPI and HPA
Source: Dev.to

The path to scalable ML deployment requires high‑performance APIs and robust orchestration. This post walks through setting up a local, highly available, and auto‑scaling inference service using FastAPI for speed and Kind for Kubernetes orchestration.
Phase 1: The FastAPI Inference Service
Our Python service handles ONNX model inference. The critical component for K8s stability is the /health endpoint:
# app.py snippet
# ... model loading logic ...
@app.get("/health")
def health_check():
# K8s Probes will hit this endpoint frequently
return {"status": "ok", "model_loaded": True}
# ... /predict endpoint ...
Phase 2: Docker and Kubernetes Deployment
After building the image (clothing-classifier:latest) and loading it into Kind, we define the Deployment. Note the crucial resource constraints and probes.
# deployment.yaml (Snippet focusing on probes and resources)
resources:
requests:
cpu: "250m" # For scheduling
memory: "500Mi"
limits:
cpu: "500m" # To prevent monopolizing the node
memory: "1Gi"
livenessProbe:
httpGet: {path: /health, port: 8000}
initialDelaySeconds: 5
readinessProbe:
httpGet: {path: /health, port: 8000}
initialDelaySeconds: 5 # Gives time for the ONNX model to load
Phase 3: Implementing Horizontal Pod Autoscaler (HPA)
Scalability is handled by the HPA, which requires the Metrics Server to be running.
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: clothing-classifier-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: clothing-classifier-deployment
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # Scale up if CPU exceeds 50%
Result: Under load, the HPA dynamically adjusts replica count. This is the definition of elastic, cost‑effective MLOps.
Read the full guide here.