Prompt Unit Tests:3 个 Bash 脚本在部署前捕获回归

发布: (2026年3月31日 GMT+8 19:36)
4 分钟阅读
原文: Dev.to

Source: Dev.to

脚本 1:黄金输出测试

此脚本向你的提示发送固定输入,并将输出与已知的正确响应进行比较。

#!/bin/bash
# test-golden.sh — Compare prompt output against golden file

PROMPT_FILE="$1"
INPUT_FILE="$2"
GOLDEN_FILE="$3"

ACTUAL=$(cat "$PROMPT_FILE" "$INPUT_FILE" | \
  curl -s https://api.openai.com/v1/chat/completions \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -H "Content-Type: application/json" \
    -d @-  /tmp/prompt-test-actual.txt

if diff -q "$GOLDEN_FILE" /tmp/prompt-test-actual.txt > /dev/null 2>&1; then
  echo "✅ PASS: Output matches golden file"
else
  echo "❌ FAIL: Output diverged"
  diff --color "$GOLDEN_FILE" /tmp/prompt-test-actual.txt
  exit 1
fi

使用方法

./test-golden.sh prompts/summarize.txt fixtures/input-1.txt fixtures/expected-1.txt

何时使用

  • 在任何提示编辑后运行。
  • temperature: 0 设置为确定性输出。
  • 当你 希望 输出改变时,手动更新黄金文件。

脚本 2:关键词门

有时你不需要完全匹配——只需要输出包含(或 包含)特定词汇。

#!/bin/bash
# test-keywords.sh — Assert required/forbidden keywords in output

PROMPT_FILE="$1"
INPUT_FILE="$2"
REQUIRED="$3"   # comma-separated: "function,return,async"
FORBIDDEN="$4"  # comma-separated: "TODO,FIXME,undefined"

ACTUAL=$(curl -s https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"gpt-4o-mini\",
    \"messages\": [
      {\"role\": \"system\", \"content\": $(jq -Rs .  /dev/null 2>&1; then
      echo "✅ PASS: Valid JSON"
    else
      echo "❌ FAIL: Invalid JSON"
      echo "$ACTUAL"
      exit 1
    fi
    ;;
  has-headers)
    if echo "$ACTUAL" | grep -q "^#"; then
      echo "✅ PASS: Contains markdown headers"
    else
      echo "❌ FAIL: No markdown headers found"
      exit 1
    fi
    ;;
  max-lines:*)
    MAX="${FORMAT#max-lines:}"
    LINES=$(echo "$ACTUAL" | wc -l)
    if [ "$LINES" -le "$MAX" ]; then
      echo "✅ PASS: $LINES lines (max: $MAX)"
    else
      echo "❌ FAIL: $LINES lines exceeds max $MAX"
      exit 1
    fi
    ;;
esac

脚本 3:格式检查

此脚本验证输出是否符合特定格式(JSON、YAML、Markdown 等),并可检查标题、行数等属性。

#!/bin/bash
# test-format.sh — Verify output format and optional constraints

PROMPT_FILE="$1"
INPUT_FILE="$2"
FORMAT="$3"   # json|yaml|markdown|has-headers|max-lines:NN

ACTUAL=$(cat "$PROMPT_FILE" "$INPUT_FILE" | \
  curl -s https://api.openai.com/v1/chat/completions \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -H "Content-Type: application/json" \
    -d @-)

case "$FORMAT" in
  json)
    if echo "$ACTUAL" | jq . > /dev/null 2>&1; then
      echo "✅ PASS: Valid JSON"
    else
      echo "❌ FAIL: Invalid JSON"
      echo "$ACTUAL"
      exit 1
    fi
    ;;
  yaml)
    if echo "$ACTUAL" | python -c "import sys, yaml; yaml.safe_load(sys.stdin)" > /dev/null 2>&1; then
      echo "✅ PASS: Valid YAML"
    else
      echo "❌ FAIL: Invalid YAML"
      echo "$ACTUAL"
      exit 1
    fi
    ;;
  markdown)
    if echo "$ACTUAL" | grep -q "^#"; then
      echo "✅ PASS: Contains markdown headers"
    else
      echo "❌ FAIL: No markdown headers found"
      exit 1
    fi
    ;;
  has-headers)
    if echo "$ACTUAL" | grep -q "^#"; then
      echo "✅ PASS: Contains markdown headers"
    else
      echo "❌ FAIL: No markdown headers found"
      exit 1
    fi
    ;;
  max-lines:*)
    MAX="${FORMAT#max-lines:}"
    LINES=$(echo "$ACTUAL" | wc -l)
    if [ "$LINES" -le "$MAX" ]; then
      echo "✅ PASS: $LINES lines (max: $MAX)"
    else
      echo "❌ FAIL: $LINES lines exceeds max $MAX"
      exit 1
    fi
    ;;
esac

综合使用

我在 Makefile 中一次性运行这三个脚本:

test-prompts:
	./test-golden.sh prompts/summarize.txt fixtures/doc-1.txt fixtures/expected-summary-1.txt
	./test-keywords.sh prompts/review.txt fixtures/pr-1.txt "security,performance" "LGTM"
	./test-format.sh prompts/extract.txt fixtures/email-1.txt json

将其挂到 CI:

# .github/workflows/prompt-tests.yml
on:
  push:
    paths: ['prompts/**']
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: make test-prompts
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

现在每一次提示的更改都会自动进行测试。整体搭建时间约为 20 分钟。自从开始使用以来已捕获七次回归。

你的提示就是代码。像对待代码一样对它们进行测试吧。

0 浏览
Back to Blog

相关文章

阅读更多 »

让 OpenClaw 在压缩后记住它的操作

为什么会这样?虽然 AI 看起来像魔法,运作也像魔法,但在底层它仍然有其局限性,在这种情况下,就是它的上下文窗口 https://pla...