Building Production AI: A Three-Part MLOps Journey - Pt.2

Published: (January 18, 2026 at 11:57 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

The Training Lab: Google Colab Setup

First things first: we need a place to work. Training AI is like running a marathon for a computer—it’s exhausting. We use Google Colab because it gives us a free T4 GPU, the “engine” we need to train our Adire model.

Install the required libraries

# ========================================
# Cell 1: Environment Setup
# ========================================
!pip install -q diffusers==0.25.0 transformers==4.36.0 \
             accelerate==0.25.0 peft==0.7.1 bitsandbytes

# Verify that the GPU is available
import torch
assert torch.cuda.is_available(), "No GPU found! Check your Colab settings."
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")

Download the training script

# ========================================
# Cell 2: Download the 'Recipe'
# ========================================
# Grab a proven script from the HuggingFace team.
!wget -q https://raw.githubusercontent.com/huggingface/diffusers/v0.36.0/examples/dreambooth/train_dreambooth_lora.py

Configure the training run

# ========================================
# Cell 3: The Secret Sauce (Configuration)
# ========================================
CONFIG = {
    "model": "runwayml/stable-diffusion-v1-5",
    "output_dir": "./lora_weights",
    "instance_data_dir": "./training_images",
    "instance_prompt": "a photo in nigerian_adire_style",
    "resolution": 512,
    "train_batch_size": 1,
    "gradient_accumulation_steps": 4,   # “save up” steps to act like a bigger batch
    "learning_rate": 1e-4,
    "lr_scheduler": "constant",
    "max_train_steps": 800,             # 800 iterations is usually the sweet spot
    "lora_rank": 4,
    "lora_alpha": 4,
    "seed": 42
}

Launch the training

# ========================================
# Cell 4: Ignition!
# ========================================
!accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path="{CONFIG['model']}" \
  --instance_data_dir="{CONFIG['instance_data_dir']}" \
  --output_dir="{CONFIG['output_dir']}" \
  --instance_prompt="{CONFIG['instance_prompt']}" \
  --resolution={CONFIG['resolution']} \
  --train_batch_size={CONFIG['train_batch_size']} \
  --gradient_accumulation_steps={CONFIG['gradient_accumulation_steps']} \
  --learning_rate={CONFIG['learning_rate']} \
  --lr_scheduler="{CONFIG['lr_scheduler']}" \
  --max_train_steps={CONFIG['max_train_steps']} \
  --use_8bit_adam \
  --checkpointing_steps=100 \
  --validation_prompt="{CONFIG['instance_prompt']} sunset over Lagos" \
  --seed={CONFIG['seed']}

Tuning the Engine: Hyperparameter Analysis

You might wonder why I chose those specific numbers in CONFIG. AI training is a bit like cooking—a pinch too much salt ruins the soup.

HyperparameterReasoning
Learning Rate (1e-4)Too high → the model “panics” and learns nothing. Too low → training drags on for days.
Effective Batch Size (1 × 4)We train on one image at a time but accumulate gradients over four steps, keeping training stable without blowing GPU memory.
LoRA Rank (4)Lean and fast. A rank of 16 would make the file ~4× larger with negligible quality gain. Efficiency is the goal.

The Factory: Building the MLOps Pipeline

Now we step away from the notebook and build a real software system. In production you don’t want to manually copy‑paste files, so we use ZenML to create a conveyor belt.

Our pipeline has three main “employees”:

  • Evaluator – Checks whether the model actually creates Adire patterns or just noise.
  • Promoter – The “manager” that looks at test scores and decides if the model is good enough for customers.
  • Deployer – Packs the model up and ships it to the cloud.

Step 1: The Evaluator (Quality Control)

This step loads the newly trained model, generates a few pictures, measures latency, and evaluates how well the images match the prompts. All stats are logged to MLflow for permanent record‑keeping.

@step(enable_cache=False)
def evaluate_model(model_path: str, test_prompts: List[str]) -> Dict[str, float]:
    """
    Load the Stable Diffusion model + LoRA weights, generate images,
    time the generation, and compute a quality score using CLIP.
    """
    # Load the base model and LoRA adapters
    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16,
        safety_checker=None,
    )
    pipe.unet.load_attn_procs(model_path)

    results = {}
    for i, prompt in enumerate(test_prompts):
        # Measure generation time
        start = time.time()
        image = pipe(prompt).images[0]
        gen_time = time.time() - start

        # Compute a CLIP‑based similarity score (higher = better)
        quality = compute_clip_score(image, prompt)

        results[f"prompt_{i}_time"] = gen_time
        results[f"prompt_{i}_quality"] = quality

    # Log to MLflow (example)
    mlflow.log_metrics(results)
    return results

Step 2: The Promoter (Gatekeeper)

@step
def promote_model(metrics: Dict[str, float], threshold: float = 0.75) -> bool:
    """
    Decide whether the model passes quality gates.
    Returns True if the average quality score exceeds `threshold`.
    """
    quality_keys = [k for k in metrics if "quality" in k]
    avg_quality = sum(metrics[k] for k in quality_keys) / len(quality_keys)

    mlflow.log_metric("avg_quality", avg_quality)
    return avg_quality >= threshold

Step 3: The Deployer (Shipping)

@step
def deploy_model(model_path: str, approved: bool) -> None:
    """
    If `approved` is True, push the model to a model registry
    (e.g., Hugging Face Hub, S3, or a custom endpoint).
    """
    if not approved:
        logger.info("Model did not meet quality thresholds – not deploying.")
        return

    # Example: push to Hugging Face Hub
    repo_id = "my-org/adire-lora"
    huggingface_hub.create_repo(repo_id, exist_ok=True)
    upload_folder(
        repo_id=repo_id,
        folder_path=model_path,
        commit_message="Deploy new Adire LoRA weights",
    )
    logger.info(f"Model deployed to https://huggingface.co/{repo_id}")

Full ZenML Pipeline

from zenml import pipeline

@pipeline
def adire_training_pipeline(
    evaluator,
    promoter,
    deployer,
    model_path: str,
    test_prompts: List[str],
    quality_threshold: float = 0.75,
):
    metrics = evaluator(model_path=model_path, test_prompts=test_prompts)
    approved = promoter(metrics=metrics, threshold=quality_threshold)
    deployer(model_path=model_path, approved=approved)

Running the pipeline will:

  1. Train the LoRA on Colab.
  2. Evaluate the resulting model on a held‑out prompt set.
  3. Promote it only if it meets the quality threshold.
  4. Deploy the approved model to a remote registry for downstream consumption.

TL;DR

  • Colab → free GPU for iterative LoRA training.
  • CONFIG → carefully chosen hyper‑parameters for speed & quality.
  • ZenML pipeline → automated evaluation, gating, and deployment.

With this setup you have a repeatable, production‑ready “factory” that turns raw Adire images into a continuously improving generative model. 🚀

Akan

clip_score(image, prompt)

metrics = {"avg_time": gen_time, "avg_quality": quality}
mlflow.log_metrics(metrics)  # Keep a receipt!
return metrics

Step 2: The Promoter (The Decision Maker)

This is our automated Quality Gate. We set strict rules: if the quality is below 0.75, or if it takes longer than 30 seconds to draw a picture, the model is “fired.” If it passes, it gets promoted to “Production” status.

@step
def promote_model(metrics: Dict[str, float], thresholds: Dict[str, float]):
    # Does it meet our standards?
    checks = {
        "quality_check": metrics["avg_quality"] >= thresholds["quality"],
        "speed_check": metrics["avg_time"] <= thresholds["speed"],
    }
    return all(checks.values())
Back to Blog

Related posts

Read more »

Rapg: TUI-based Secret Manager

We've all been there. You join a new project, and the first thing you hear is: > 'Check the pinned message in Slack for the .env file.' Or you have several .env...

Technology is an Enabler, not a Saviour

Why clarity of thinking matters more than the tools you use Technology is often treated as a magic switch—flip it on, and everything improves. New software, pl...