Building Production AI: A Three-Part MLOps Journey - Pt.2
Source: Dev.to
The Training Lab: Google Colab Setup
First things first: we need a place to work. Training AI is like running a marathon for a computer—it’s exhausting. We use Google Colab because it gives us a free T4 GPU, the “engine” we need to train our Adire model.
Install the required libraries
# ========================================
# Cell 1: Environment Setup
# ========================================
!pip install -q diffusers==0.25.0 transformers==4.36.0 \
accelerate==0.25.0 peft==0.7.1 bitsandbytes
# Verify that the GPU is available
import torch
assert torch.cuda.is_available(), "No GPU found! Check your Colab settings."
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
Download the training script
# ========================================
# Cell 2: Download the 'Recipe'
# ========================================
# Grab a proven script from the HuggingFace team.
!wget -q https://raw.githubusercontent.com/huggingface/diffusers/v0.36.0/examples/dreambooth/train_dreambooth_lora.py
Configure the training run
# ========================================
# Cell 3: The Secret Sauce (Configuration)
# ========================================
CONFIG = {
"model": "runwayml/stable-diffusion-v1-5",
"output_dir": "./lora_weights",
"instance_data_dir": "./training_images",
"instance_prompt": "a photo in nigerian_adire_style",
"resolution": 512,
"train_batch_size": 1,
"gradient_accumulation_steps": 4, # “save up” steps to act like a bigger batch
"learning_rate": 1e-4,
"lr_scheduler": "constant",
"max_train_steps": 800, # 800 iterations is usually the sweet spot
"lora_rank": 4,
"lora_alpha": 4,
"seed": 42
}
Launch the training
# ========================================
# Cell 4: Ignition!
# ========================================
!accelerate launch train_dreambooth_lora.py \
--pretrained_model_name_or_path="{CONFIG['model']}" \
--instance_data_dir="{CONFIG['instance_data_dir']}" \
--output_dir="{CONFIG['output_dir']}" \
--instance_prompt="{CONFIG['instance_prompt']}" \
--resolution={CONFIG['resolution']} \
--train_batch_size={CONFIG['train_batch_size']} \
--gradient_accumulation_steps={CONFIG['gradient_accumulation_steps']} \
--learning_rate={CONFIG['learning_rate']} \
--lr_scheduler="{CONFIG['lr_scheduler']}" \
--max_train_steps={CONFIG['max_train_steps']} \
--use_8bit_adam \
--checkpointing_steps=100 \
--validation_prompt="{CONFIG['instance_prompt']} sunset over Lagos" \
--seed={CONFIG['seed']}
Tuning the Engine: Hyperparameter Analysis
You might wonder why I chose those specific numbers in CONFIG. AI training is a bit like cooking—a pinch too much salt ruins the soup.
| Hyperparameter | Reasoning |
|---|---|
Learning Rate (1e-4) | Too high → the model “panics” and learns nothing. Too low → training drags on for days. |
Effective Batch Size (1 × 4) | We train on one image at a time but accumulate gradients over four steps, keeping training stable without blowing GPU memory. |
LoRA Rank (4) | Lean and fast. A rank of 16 would make the file ~4× larger with negligible quality gain. Efficiency is the goal. |
The Factory: Building the MLOps Pipeline
Now we step away from the notebook and build a real software system. In production you don’t want to manually copy‑paste files, so we use ZenML to create a conveyor belt.
Our pipeline has three main “employees”:
- Evaluator – Checks whether the model actually creates Adire patterns or just noise.
- Promoter – The “manager” that looks at test scores and decides if the model is good enough for customers.
- Deployer – Packs the model up and ships it to the cloud.
Step 1: The Evaluator (Quality Control)
This step loads the newly trained model, generates a few pictures, measures latency, and evaluates how well the images match the prompts. All stats are logged to MLflow for permanent record‑keeping.
@step(enable_cache=False)
def evaluate_model(model_path: str, test_prompts: List[str]) -> Dict[str, float]:
"""
Load the Stable Diffusion model + LoRA weights, generate images,
time the generation, and compute a quality score using CLIP.
"""
# Load the base model and LoRA adapters
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
safety_checker=None,
)
pipe.unet.load_attn_procs(model_path)
results = {}
for i, prompt in enumerate(test_prompts):
# Measure generation time
start = time.time()
image = pipe(prompt).images[0]
gen_time = time.time() - start
# Compute a CLIP‑based similarity score (higher = better)
quality = compute_clip_score(image, prompt)
results[f"prompt_{i}_time"] = gen_time
results[f"prompt_{i}_quality"] = quality
# Log to MLflow (example)
mlflow.log_metrics(results)
return results
Step 2: The Promoter (Gatekeeper)
@step
def promote_model(metrics: Dict[str, float], threshold: float = 0.75) -> bool:
"""
Decide whether the model passes quality gates.
Returns True if the average quality score exceeds `threshold`.
"""
quality_keys = [k for k in metrics if "quality" in k]
avg_quality = sum(metrics[k] for k in quality_keys) / len(quality_keys)
mlflow.log_metric("avg_quality", avg_quality)
return avg_quality >= threshold
Step 3: The Deployer (Shipping)
@step
def deploy_model(model_path: str, approved: bool) -> None:
"""
If `approved` is True, push the model to a model registry
(e.g., Hugging Face Hub, S3, or a custom endpoint).
"""
if not approved:
logger.info("Model did not meet quality thresholds – not deploying.")
return
# Example: push to Hugging Face Hub
repo_id = "my-org/adire-lora"
huggingface_hub.create_repo(repo_id, exist_ok=True)
upload_folder(
repo_id=repo_id,
folder_path=model_path,
commit_message="Deploy new Adire LoRA weights",
)
logger.info(f"Model deployed to https://huggingface.co/{repo_id}")
Full ZenML Pipeline
from zenml import pipeline
@pipeline
def adire_training_pipeline(
evaluator,
promoter,
deployer,
model_path: str,
test_prompts: List[str],
quality_threshold: float = 0.75,
):
metrics = evaluator(model_path=model_path, test_prompts=test_prompts)
approved = promoter(metrics=metrics, threshold=quality_threshold)
deployer(model_path=model_path, approved=approved)
Running the pipeline will:
- Train the LoRA on Colab.
- Evaluate the resulting model on a held‑out prompt set.
- Promote it only if it meets the quality threshold.
- Deploy the approved model to a remote registry for downstream consumption.
TL;DR
- Colab → free GPU for iterative LoRA training.
- CONFIG → carefully chosen hyper‑parameters for speed & quality.
- ZenML pipeline → automated evaluation, gating, and deployment.
With this setup you have a repeatable, production‑ready “factory” that turns raw Adire images into a continuously improving generative model. 🚀
clip_score(image, prompt)
metrics = {"avg_time": gen_time, "avg_quality": quality}
mlflow.log_metrics(metrics) # Keep a receipt!
return metrics
Step 2: The Promoter (The Decision Maker)
This is our automated Quality Gate. We set strict rules: if the quality is below 0.75, or if it takes longer than 30 seconds to draw a picture, the model is “fired.” If it passes, it gets promoted to “Production” status.
@step
def promote_model(metrics: Dict[str, float], thresholds: Dict[str, float]):
# Does it meet our standards?
checks = {
"quality_check": metrics["avg_quality"] >= thresholds["quality"],
"speed_check": metrics["avg_time"] <= thresholds["speed"],
}
return all(checks.values()) 