构建生产 AI：三阶段 MLOps 之旅 - 第2部分

发布: 1天前 (2026年1月19日 GMT+8 00:57)

7 min read

抱歉，我需要您提供要翻译的正文内容（除代码块和 URL 之外的文字），才能为您完成简体中文翻译。请把文章的文本粘贴在这里，我会保持原有的 Markdown 格式并只翻译正文部分。

Source: …

训练实验室：Google Colab 设置

首先：我们需要一个工作环境。训练 AI 就像让电脑跑马拉松——非常耗费资源。我们使用 Google Colab，因为它提供免费 T4 GPU，这就是训练我们的 Adire 模型所需的“引擎”。

安装所需库

# ========================================
# Cell 1: Environment Setup
# ========================================
!pip install -q diffusers==0.25.0 transformers==4.36.0 \
             accelerate==0.25.0 peft==0.7.1 bitsandbytes

# Verify that the GPU is available
import torch
assert torch.cuda.is_available(), "No GPU found! Check your Colab settings."
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")

下载训练脚本

# ========================================
# Cell 2: Download the 'Recipe'
# ========================================
# Grab a proven script from the HuggingFace team.
!wget -q https://raw.githubusercontent.com/huggingface/diffusers/v0.36.0/examples/dreambooth/train_dreambooth_lora.py

配置训练运行

# ========================================
# Cell 3: The Secret Sauce (Configuration)
# ========================================
CONFIG = {
    "model": "runwayml/stable-diffusion-v1-5",
    "output_dir": "./lora_weights",
    "instance_data_dir": "./training_images",
    "instance_prompt": "a photo in nigerian_adire_style",
    "resolution": 512,
    "train_batch_size": 1,
    "gradient_accumulation_steps": 4,   # “save up” steps to act like a bigger batch
    "learning_rate": 1e-4,
    "lr_scheduler": "constant",
    "max_train_steps": 800,             # 800 iterations is usually the sweet spot
    "lora_rank": 4,
    "lora_alpha": 4,
    "seed": 42
}

启动训练

# ========================================
# Cell 4: Ignition!
# ========================================
!accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path="{CONFIG['model']}" \
  --instance_data_dir="{CONFIG['instance_data_dir']}" \
  --output_dir="{CONFIG['output_dir']}" \
  --instance_prompt="{CONFIG['instance_prompt']}" \
  --resolution={CONFIG['resolution']} \
  --train_batch_size={CONFIG['train_batch_size']} \
  --gradient_accumulation_steps={CONFIG['gradient_accumulation_steps']} \
  --learning_rate={CONFIG['learning_rate']} \
  --lr_scheduler="{CONFIG['lr_scheduler']}" \
  --max_train_steps={CONFIG['max_train_steps']} \
  --use_8bit_adam \
  --checkpointing_steps=100 \
  --validation_prompt="{CONFIG['instance_prompt']} sunset over Lagos" \
  --seed={CONFIG['seed']}

调整引擎：超参数分析

您可能会好奇我为何在 CONFIG 中选择了这些特定的数值。AI 训练有点像烹饪——盐放多了一点就会毁了汤。

超参数	说明
学习率 (`1e-4`)	过高 → 模型“惊慌”，什么也学不到。过低 → 训练会拖延数天。
有效批大小 (`1 × 4`)	我们一次只训练一张图像，但在四个步骤上累积梯度，从而保持训练稳定且不会耗尽 GPU 显存。
LoRA 阶数 (`4`)	轻量且快速。若使用 16 的阶数，文件大小会增加约 4 倍，而质量提升微乎其微。目标是效率。

Source: …

工厂：构建 MLOps 流水线

现在我们离开 notebook，构建一个真实的软件系统。在生产环境中，你不想手动复制粘贴文件，所以我们使用 ZenML 来创建一条传送带。

我们的流水线有三个主要的“员工”：

Evaluator（评估器） – 检查模型是否真的生成 Adire 图案，还是仅仅噪声。
Promoter（晋升者） – “经理”，查看测试分数并决定模型是否足够好可以交付给客户。
Deployer（部署者） – 打包模型并将其发送到云端。

步骤 1：Evaluator（质量控制）

此步骤加载新训练的模型，生成几张图片，测量延迟，并评估图像与提示的匹配程度。所有统计信息都会记录到 MLflow 以便永久保存。

@step(enable_cache=False)
def evaluate_model(model_path: str, test_prompts: List[str]) -> Dict[str, float]:
    """
    Load the Stable Diffusion model + LoRA weights, generate images,
    time the generation, and compute a quality score using CLIP.
    """
    # Load the base model and LoRA adapters
    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16,
        safety_checker=None,
    )
    pipe.unet.load_attn_procs(model_path)

    results = {}
    for i, prompt in enumerate(test_prompts):
        # Measure generation time
        start = time.time()
        image = pipe(prompt).images[0]
        gen_time = time.time() - start

        # Compute a CLIP‑based similarity score (higher = better)
        quality = compute_clip_score(image, prompt)

        results[f"prompt_{i}_time"] = gen_time
        results[f"prompt_{i}_quality"] = quality

    # Log to MLflow (example)
    mlflow.log_metrics(results)
    return results

步骤 2：Promoter（门卫）

@step
def promote_model(metrics: Dict[str, float], threshold: float = 0.75) -> bool:
    """
    Decide whether the model passes quality gates.
    Returns True if the average quality score exceeds `threshold`.
    """
    quality_keys = [k for k in metrics if "quality" in k]
    avg_quality = sum(metrics[k] for k in quality_keys) / len(quality_keys)

    mlflow.log_metric("avg_quality", avg_quality)
    return avg_quality >= threshold

步骤 3：Deployer（发货）

@step
def deploy_model(model_path: str, approved: bool) -> None:
    """
    If `approved` is True, push the model to a model registry
    (e.g., Hugging Face Hub, S3, or a custom endpoint).
    """
    if not approved:
        logger.info("Model did not meet quality thresholds – not deploying.")
        return

    # Example: push to Hugging Face Hub
    repo_id = "my-org/adire-lora"
    huggingface_hub.create_repo(repo_id, exist_ok=True)
    upload_folder(
        repo_id=repo_id,
        folder_path=model_path,
        commit_message="Deploy new Adire LoRA weights",
    )
    logger.info(f"Model deployed to https://huggingface.co/{repo_id}")

完整 ZenML 流水线

from zenml import pipeline

@pipeline
def adire_training_pipeline(
    evaluator,
    promoter,
    deployer,
    model_path: str,
    test_prompts: List[str],
    quality_threshold: float = 0.75,
):
    metrics = evaluator(model_path=model_path, test_prompts=test_prompts)
    approved = promoter(metrics=metrics, threshold=quality_threshold)
    deployer(model_path=model_path, approved=approved)

运行该流水线将会：

训练 LoRA（在 Colab 上）。
评估生成的模型在保留的提示集上的表现。
晋升仅当模型满足质量阈值时。
部署已批准的模型到远程注册表，以供下游使用。

TL;DR

Colab → 免费 GPU，用于迭代 LoRA 训练。
CONFIG → 为速度和质量精心挑选的超参数。
ZenML pipeline → 自动化评估、门控和部署。

使用此配置，你就拥有一个可重复、可投产的“工厂”，能够将原始 Adire 图像转化为持续改进的生成模型。 🚀

clip_score(image, prompt)

metrics = {"avg_time": gen_time, "avg_quality": quality}
mlflow.log_metrics(metrics)  # Keep a receipt!
return metrics

Step 2: The Promoter (The Decision Maker)

这是我们的自动化 质量门。我们设定了严格的规则：如果质量低于 0.75，或绘制一张图片耗时超过 30 秒，模型将被 “解雇”。 若通过，则晋升为 “生产” 状态。

@step
def promote_model(metrics: Dict[str, float], thresholds: Dict[str, float]):
    # Does it meet our standards?
    checks = {
        "quality_check": metrics["avg_quality"] >= thresholds["quality"],
        "speed_check": metrics["avg_time"] <= thresholds["speed"],
    }
    return all(checks.values())

构建生产 AI：三阶段 MLOps 之旅 - 第2部分

训练实验室：Google Colab 设置

安装所需库

下载训练脚本

配置训练运行

启动训练

调整引擎：超参数分析

工厂：构建 MLOps 流水线

步骤 1：Evaluator（质量控制）

步骤 2：Promoter（门卫）

步骤 3：Deployer（发货）

完整 ZenML 流水线

TL;DR

clip_score(image, prompt)

Step 2: The Promoter (The Decision Maker)

相关文章

Rapg：基于 TUI 的密钥管理器

技术是赋能者，而非救世主

行业调查：编码更快，调试更慢

踏入 agentic coding

训练实验室：Google Colab 设置

安装所需库

下载训练脚本

配置训练运行

启动训练

调整引擎：超参数分析

工厂：构建 MLOps 流水线

步骤 1：Evaluator（质量控制）

步骤 2：Promoter（门卫）

步骤 3：Deployer（发货）

完整 ZenML 流水线

TL;DR

clip_score(image, prompt)

Step 2: The Promoter (The Decision Maker)

相关文章

Rapg：基于 TUI 的密钥管理器

技术是赋能者，而非救世主

行业调查：编码更快，调试更慢

踏入 agentic coding

训练实验室：Google Colab 设置

步骤 1：Evaluator（质量控制）

步骤 2：Promoter（门卫）

步骤 3：Deployer（发货）

Step 2: The Promoter (The Decision Maker)