构建生产 AI:三阶段 MLOps 之旅 - 第2部分
抱歉,我需要您提供要翻译的正文内容(除代码块和 URL 之外的文字),才能为您完成简体中文翻译。请把文章的文本粘贴在这里,我会保持原有的 Markdown 格式并只翻译正文部分。
Source: …
训练实验室:Google Colab 设置
首先:我们需要一个工作环境。训练 AI 就像让电脑跑马拉松——非常耗费资源。我们使用 Google Colab,因为它提供免费 T4 GPU,这就是训练我们的 Adire 模型所需的“引擎”。
安装所需库
# ========================================
# Cell 1: Environment Setup
# ========================================
!pip install -q diffusers==0.25.0 transformers==4.36.0 \
accelerate==0.25.0 peft==0.7.1 bitsandbytes
# Verify that the GPU is available
import torch
assert torch.cuda.is_available(), "No GPU found! Check your Colab settings."
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
下载训练脚本
# ========================================
# Cell 2: Download the 'Recipe'
# ========================================
# Grab a proven script from the HuggingFace team.
!wget -q https://raw.githubusercontent.com/huggingface/diffusers/v0.36.0/examples/dreambooth/train_dreambooth_lora.py
配置训练运行
# ========================================
# Cell 3: The Secret Sauce (Configuration)
# ========================================
CONFIG = {
"model": "runwayml/stable-diffusion-v1-5",
"output_dir": "./lora_weights",
"instance_data_dir": "./training_images",
"instance_prompt": "a photo in nigerian_adire_style",
"resolution": 512,
"train_batch_size": 1,
"gradient_accumulation_steps": 4, # “save up” steps to act like a bigger batch
"learning_rate": 1e-4,
"lr_scheduler": "constant",
"max_train_steps": 800, # 800 iterations is usually the sweet spot
"lora_rank": 4,
"lora_alpha": 4,
"seed": 42
}
启动训练
# ========================================
# Cell 4: Ignition!
# ========================================
!accelerate launch train_dreambooth_lora.py \
--pretrained_model_name_or_path="{CONFIG['model']}" \
--instance_data_dir="{CONFIG['instance_data_dir']}" \
--output_dir="{CONFIG['output_dir']}" \
--instance_prompt="{CONFIG['instance_prompt']}" \
--resolution={CONFIG['resolution']} \
--train_batch_size={CONFIG['train_batch_size']} \
--gradient_accumulation_steps={CONFIG['gradient_accumulation_steps']} \
--learning_rate={CONFIG['learning_rate']} \
--lr_scheduler="{CONFIG['lr_scheduler']}" \
--max_train_steps={CONFIG['max_train_steps']} \
--use_8bit_adam \
--checkpointing_steps=100 \
--validation_prompt="{CONFIG['instance_prompt']} sunset over Lagos" \
--seed={CONFIG['seed']}
调整引擎:超参数分析
您可能会好奇我为何在 CONFIG 中选择了这些特定的数值。AI 训练有点像烹饪——盐放多了一点就会毁了汤。
| 超参数 | 说明 |
|---|---|
学习率 (1e-4) | 过高 → 模型“惊慌”,什么也学不到。过低 → 训练会拖延数天。 |
有效批大小 (1 × 4) | 我们一次只训练一张图像,但在四个步骤上累积梯度,从而保持训练稳定且不会耗尽 GPU 显存。 |
LoRA 阶数 (4) | 轻量且快速。若使用 16 的阶数,文件大小会增加约 4 倍,而质量提升微乎其微。目标是效率。 |
Source: …
工厂:构建 MLOps 流水线
现在我们离开 notebook,构建一个真实的软件系统。在生产环境中,你不想手动复制粘贴文件,所以我们使用 ZenML 来创建一条传送带。
我们的流水线有三个主要的“员工”:
- Evaluator(评估器) – 检查模型是否真的生成 Adire 图案,还是仅仅噪声。
- Promoter(晋升者) – “经理”,查看测试分数并决定模型是否足够好可以交付给客户。
- Deployer(部署者) – 打包模型并将其发送到云端。
步骤 1:Evaluator(质量控制)
此步骤加载新训练的模型,生成几张图片,测量延迟,并评估图像与提示的匹配程度。所有统计信息都会记录到 MLflow 以便永久保存。
@step(enable_cache=False)
def evaluate_model(model_path: str, test_prompts: List[str]) -> Dict[str, float]:
"""
Load the Stable Diffusion model + LoRA weights, generate images,
time the generation, and compute a quality score using CLIP.
"""
# Load the base model and LoRA adapters
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
safety_checker=None,
)
pipe.unet.load_attn_procs(model_path)
results = {}
for i, prompt in enumerate(test_prompts):
# Measure generation time
start = time.time()
image = pipe(prompt).images[0]
gen_time = time.time() - start
# Compute a CLIP‑based similarity score (higher = better)
quality = compute_clip_score(image, prompt)
results[f"prompt_{i}_time"] = gen_time
results[f"prompt_{i}_quality"] = quality
# Log to MLflow (example)
mlflow.log_metrics(results)
return results
步骤 2:Promoter(门卫)
@step
def promote_model(metrics: Dict[str, float], threshold: float = 0.75) -> bool:
"""
Decide whether the model passes quality gates.
Returns True if the average quality score exceeds `threshold`.
"""
quality_keys = [k for k in metrics if "quality" in k]
avg_quality = sum(metrics[k] for k in quality_keys) / len(quality_keys)
mlflow.log_metric("avg_quality", avg_quality)
return avg_quality >= threshold
步骤 3:Deployer(发货)
@step
def deploy_model(model_path: str, approved: bool) -> None:
"""
If `approved` is True, push the model to a model registry
(e.g., Hugging Face Hub, S3, or a custom endpoint).
"""
if not approved:
logger.info("Model did not meet quality thresholds – not deploying.")
return
# Example: push to Hugging Face Hub
repo_id = "my-org/adire-lora"
huggingface_hub.create_repo(repo_id, exist_ok=True)
upload_folder(
repo_id=repo_id,
folder_path=model_path,
commit_message="Deploy new Adire LoRA weights",
)
logger.info(f"Model deployed to https://huggingface.co/{repo_id}")
完整 ZenML 流水线
from zenml import pipeline
@pipeline
def adire_training_pipeline(
evaluator,
promoter,
deployer,
model_path: str,
test_prompts: List[str],
quality_threshold: float = 0.75,
):
metrics = evaluator(model_path=model_path, test_prompts=test_prompts)
approved = promoter(metrics=metrics, threshold=quality_threshold)
deployer(model_path=model_path, approved=approved)
运行该流水线将会:
- 训练 LoRA(在 Colab 上)。
- 评估 生成的模型在保留的提示集上的表现。
- 晋升 仅当模型满足质量阈值时。
- 部署 已批准的模型到远程注册表,以供下游使用。
TL;DR
- Colab → 免费 GPU,用于迭代 LoRA 训练。
- CONFIG → 为速度和质量精心挑选的超参数。
- ZenML pipeline → 自动化评估、门控和部署。
使用此配置,你就拥有一个可重复、可投产的“工厂”,能够将原始 Adire 图像转化为持续改进的生成模型。 🚀
clip_score(image, prompt)
metrics = {"avg_time": gen_time, "avg_quality": quality}
mlflow.log_metrics(metrics) # Keep a receipt!
return metrics
Step 2: The Promoter (The Decision Maker)
这是我们的自动化 质量门。我们设定了严格的规则:如果质量低于 0.75,或绘制一张图片耗时超过 30 秒,模型将被 “解雇”。 若通过,则晋升为 “生产” 状态。
@step
def promote_model(metrics: Dict[str, float], thresholds: Dict[str, float]):
# Does it meet our standards?
checks = {
"quality_check": metrics["avg_quality"] >= thresholds["quality"],
"speed_check": metrics["avg_time"] <= thresholds["speed"],
}
return all(checks.values()) 