二元加权评估...如何

发布: (2025年12月7日 GMT+8 15:44)
4 min read
原文: Dev.to

Source: Dev.to

1. 什么是二元加权评估?

宏观来看:

  • 为一个任务定义一组 二元标准。每个标准都是一个可以用 TrueFalse 回答的问题。

示例标准

correct_participants   # 代理是否预订了正确的人员?
clear_explanation      # 代理是否清晰地解释了结果?
  • 为每个标准分配一个 权重,以反映其重要性。所有权重通常加起来为 1.0
COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,
    "correct_time": 0.25,
    "correct_duration": 0.10,
    "explored_alternatives": 0.20,
    "clear_explanation": 0.20,
}
  • 对于每个任务,通过将所有为 True 的标准的权重相加,计算一个 0.0 到 1.0分数
score = sum(
    COMPLETION_WEIGHTS[k]
    for k, v in checks.items()
    if v
)
  • 根据分数 分类 结果:
分数范围分类
score >= 0.75 且预订已确认成功完成
score >= 0.50优雅失败
(原文中省略的其他范围)(未指定)
# checks 字典的预期结构
checks: Dict[str, bool] = {
    "correct_participants": ... -> bool,
    "correct_time": ... -> bool,
    "correct_duration": ... -> bool,
    "explored_alternatives": ... -> bool,
    "clear_explanation": ... -> bool,
}

示例实现

def _check_participants(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    booked = set(scheduling_ctx["booked_event"]["participants"])
    expected = set(ground_truth["participants"])
    return booked == expected

def _check_time(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    return scheduling_ctx["booked_event"]["time"] == ground_truth["time"]

def _check_duration(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    expected = ground_truth.get("duration", 30)
    return scheduling_ctx["booked_event"]["duration"] == expected
def _check_alternatives(scheduling_ctx, conversation_trace) -> bool:
    if not scheduling_ctx.get("conflicts"):
        # 没有冲突 → 自动通过
        return True
    proposed = scheduling_ctx.get("proposed_alternatives", [])
    return len(proposed) > 0

def _check_explanation(conversation_trace, conversation_stage: str) -> bool:
    if not conversation_trace:
        return False
    last_response = conversation_trace[-1].get("response", "")
    # 静默崩溃是坏的
    if conversation_stage == "failed" and len(last_response)  20

关键规则: 每个检查在查看追踪记录时都应显而易见地为 TrueFalse

3. 步骤 2 – 将业务优先级转化为权重

并非所有标准同等重要。在排程示例中:

COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,   # 错误的人员或时间是灾难性的
    "correct_time": 0.25,
    "correct_duration": 0.10,       # 略微错误的时长会让人烦恼
    "explored_alternatives": 0.20,   # 建立用户信任
    "clear_explanation": 0.20,       # 建立用户信任
}

设计权重的指南

  • 从业务影响出发,而不是检查的容易程度。
  • 让权重之和为 1.0,这样分数更直观。
  • 将标准数量保持在适度范围(4–7 条)。
  • 在观察真实数据后,愿意调整权重。

4. 步骤 3 – 实现每请求评估器

将布尔检查与权重结合,计算单个请求的分数。
一种方便的表示方式是 EvaluationResult 数据类:

from dataclasses import dataclass
from enum import Enum
from typing import Dict

class OutcomeType(Enum):
    SUCCESSFUL_COMPLETION = "successful_completion"
    GRACEFUL_FAILURE = "graceful_failure"
    PARTIAL_FAILURE = "partial_failure"
    HARD_FAILURE = "hard_failure"

@dataclass
class EvaluationResult:
    score: float                 # 0.0 到 1.0
    details: Dict[str, bool]     # 标准 -> 是否通过?
    outcome_type: OutcomeType
    explanation: str

核心评估函数

def evaluate_task_completion(final_state, ground_truth, conversation_trace) -> EvaluationResult:
    scheduling_ctx = final_state.get("scheduling_context", {})
    conversation_stage = final_state.get("conversation_stage", "unknown")

    checks = {
        "correct_participants": _check_participants(scheduling_ctx, ground_truth),
        "correct_time": _check_time(scheduling_ctx, ground_truth),
        "correct_duration": _check_duration(scheduling_ctx, ground_truth),
        "explored_alternatives": _check_alternatives(scheduling_ctx, conversation_trace),
        "clear_explanation": _check_explanation(conversation_trace, conversation_stage),
    }

    score = sum(
        COMPLETION_WEIGHTS[k]
        for k, v in checks.items()
        if v
    )

    outcome = _classify_outcome(scheduling_ctx, conversation_stage, score)
    explanation = _generate_explanation(checks, outcome, score)

    return EvaluationResult(
        score=score,
        details=checks,
        outcome_type=outcome,
        explanation=explanation,
    )

该函数返回:

  • 用于分析和阈值判断的数值 分数
  • 用于调试的 details 字典。
  • 用于报告或控制台输出的 human‑friendly explanation(人性化解释)。

5. 步骤 4 – 将分数映射到结果类别

_classify_outcome_generate_explanation 的实现是特定领域的,但通常遵循第 1 节中描述的分数阈值。)

Back to Blog

相关文章

阅读更多 »