二元加权评估...如何

发布: 3天前 (2025年12月7日 GMT+8 15:44)

4 min read

Source: Dev.to

1. 什么是二元加权评估？

宏观来看：

为一个任务定义一组 二元标准。每个标准都是一个可以用 True 或 False 回答的问题。

示例标准

correct_participants   # 代理是否预订了正确的人员？
clear_explanation      # 代理是否清晰地解释了结果？

为每个标准分配一个权重，以反映其重要性。所有权重通常加起来为 1.0。

COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,
    "correct_time": 0.25,
    "correct_duration": 0.10,
    "explored_alternatives": 0.20,
    "clear_explanation": 0.20,
}

对于每个任务，通过将所有为 True 的标准的权重相加，计算一个 0.0 到 1.0 的分数。

score = sum(
    COMPLETION_WEIGHTS[k]
    for k, v in checks.items()
    if v
)

根据分数分类结果：

分数范围	分类
`score >= 0.75` 且预订已确认	成功完成
`score >= 0.50`	优雅失败
(原文中省略的其他范围)	(未指定)

# checks 字典的预期结构
checks: Dict[str, bool] = {
    "correct_participants": ... -> bool,
    "correct_time": ... -> bool,
    "correct_duration": ... -> bool,
    "explored_alternatives": ... -> bool,
    "clear_explanation": ... -> bool,
}

示例实现

def _check_participants(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    booked = set(scheduling_ctx["booked_event"]["participants"])
    expected = set(ground_truth["participants"])
    return booked == expected

def _check_time(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    return scheduling_ctx["booked_event"]["time"] == ground_truth["time"]

def _check_duration(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    expected = ground_truth.get("duration", 30)
    return scheduling_ctx["booked_event"]["duration"] == expected

def _check_alternatives(scheduling_ctx, conversation_trace) -> bool:
    if not scheduling_ctx.get("conflicts"):
        # 没有冲突 → 自动通过
        return True
    proposed = scheduling_ctx.get("proposed_alternatives", [])
    return len(proposed) > 0

def _check_explanation(conversation_trace, conversation_stage: str) -> bool:
    if not conversation_trace:
        return False
    last_response = conversation_trace[-1].get("response", "")
    # 静默崩溃是坏的
    if conversation_stage == "failed" and len(last_response)  20

关键规则： 每个检查在查看追踪记录时都应显而易见地为 True 或 False。

3. 步骤 2 – 将业务优先级转化为权重

并非所有标准同等重要。在排程示例中：

COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,   # 错误的人员或时间是灾难性的
    "correct_time": 0.25,
    "correct_duration": 0.10,       # 略微错误的时长会让人烦恼
    "explored_alternatives": 0.20,   # 建立用户信任
    "clear_explanation": 0.20,       # 建立用户信任
}

设计权重的指南

从业务影响出发，而不是检查的容易程度。
让权重之和为 1.0，这样分数更直观。
将标准数量保持在适度范围（4–7 条）。
在观察真实数据后，愿意调整权重。

4. 步骤 3 – 实现每请求评估器

将布尔检查与权重结合，计算单个请求的分数。
一种方便的表示方式是 EvaluationResult 数据类：

from dataclasses import dataclass
from enum import Enum
from typing import Dict

class OutcomeType(Enum):
    SUCCESSFUL_COMPLETION = "successful_completion"
    GRACEFUL_FAILURE = "graceful_failure"
    PARTIAL_FAILURE = "partial_failure"
    HARD_FAILURE = "hard_failure"

@dataclass
class EvaluationResult:
    score: float                 # 0.0 到 1.0
    details: Dict[str, bool]     # 标准 -> 是否通过？
    outcome_type: OutcomeType
    explanation: str

核心评估函数

def evaluate_task_completion(final_state, ground_truth, conversation_trace) -> EvaluationResult:
    scheduling_ctx = final_state.get("scheduling_context", {})
    conversation_stage = final_state.get("conversation_stage", "unknown")

    checks = {
        "correct_participants": _check_participants(scheduling_ctx, ground_truth),
        "correct_time": _check_time(scheduling_ctx, ground_truth),
        "correct_duration": _check_duration(scheduling_ctx, ground_truth),
        "explored_alternatives": _check_alternatives(scheduling_ctx, conversation_trace),
        "clear_explanation": _check_explanation(conversation_trace, conversation_stage),
    }

    score = sum(
        COMPLETION_WEIGHTS[k]
        for k, v in checks.items()
        if v
    )

    outcome = _classify_outcome(scheduling_ctx, conversation_stage, score)
    explanation = _generate_explanation(checks, outcome, score)

    return EvaluationResult(
        score=score,
        details=checks,
        outcome_type=outcome,
        explanation=explanation,
    )

该函数返回：

用于分析和阈值判断的数值分数。
用于调试的 details 字典。
用于报告或控制台输出的 human‑friendly explanation（人性化解释）。

5. 步骤 4 – 将分数映射到结果类别

（_classify_outcome 与 _generate_explanation 的实现是特定领域的，但通常遵循第 1 节中描述的分数阈值。）

二元加权评估...如何

1. 什么是二元加权评估？

示例实现

3. 步骤 2 – 将业务优先级转化为权重

设计权重的指南

4. 步骤 3 – 实现每请求评估器

核心评估函数

5. 步骤 4 – 将分数映射到结果类别

相关文章

如何将 System prompts 用作评估的 Ground Truth

Launch HN: Mentat (YC S16) – 通过运行时干预控制 LLMs

2025 年最佳 AI 背景生成器：即时创建自定义背景

Ideogram-Ai 在 Replicate 上的 Ideogram-V3-Turbo 模型初学者指南

1. 什么是二元加权评估？

示例实现

3. 步骤 2 – 将业务优先级转化为权重

设计权重的指南

4. 步骤 3 – 实现每请求评估器

核心评估函数

5. 步骤 4 – 将分数映射到结果类别

相关文章

如何将 System prompts 用作评估的 Ground Truth

Launch HN: Mentat (YC S16) – 通过运行时干预控制 LLMs

2025 年最佳 AI 背景生成器：即时创建自定义背景

Ideogram-Ai 在 Replicate 上的 Ideogram-V3-Turbo 模型初学者指南

3. 步骤 2 – 将业务优先级转化为权重

4. 步骤 3 – 实现每请求评估器

5. 步骤 4 – 将分数映射到结果类别