二元加权评估...如何
发布: (2025年12月7日 GMT+8 15:44)
4 min read
原文: Dev.to
Source: Dev.to
1. 什么是二元加权评估?
宏观来看:
- 为一个任务定义一组 二元标准。每个标准都是一个可以用 True 或 False 回答的问题。
示例标准
correct_participants # 代理是否预订了正确的人员?
clear_explanation # 代理是否清晰地解释了结果?
- 为每个标准分配一个 权重,以反映其重要性。所有权重通常加起来为 1.0。
COMPLETION_WEIGHTS = {
"correct_participants": 0.25,
"correct_time": 0.25,
"correct_duration": 0.10,
"explored_alternatives": 0.20,
"clear_explanation": 0.20,
}
- 对于每个任务,通过将所有为
True的标准的权重相加,计算一个 0.0 到 1.0 的 分数。
score = sum(
COMPLETION_WEIGHTS[k]
for k, v in checks.items()
if v
)
- 根据分数 分类 结果:
| 分数范围 | 分类 |
|---|---|
score >= 0.75 且预订已确认 | 成功完成 |
score >= 0.50 | 优雅失败 |
| (原文中省略的其他范围) | (未指定) |
# checks 字典的预期结构
checks: Dict[str, bool] = {
"correct_participants": ... -> bool,
"correct_time": ... -> bool,
"correct_duration": ... -> bool,
"explored_alternatives": ... -> bool,
"clear_explanation": ... -> bool,
}
示例实现
def _check_participants(scheduling_ctx, ground_truth) -> bool:
if not scheduling_ctx.get("booking_confirmed"):
return False
booked = set(scheduling_ctx["booked_event"]["participants"])
expected = set(ground_truth["participants"])
return booked == expected
def _check_time(scheduling_ctx, ground_truth) -> bool:
if not scheduling_ctx.get("booking_confirmed"):
return False
return scheduling_ctx["booked_event"]["time"] == ground_truth["time"]
def _check_duration(scheduling_ctx, ground_truth) -> bool:
if not scheduling_ctx.get("booking_confirmed"):
return False
expected = ground_truth.get("duration", 30)
return scheduling_ctx["booked_event"]["duration"] == expected
def _check_alternatives(scheduling_ctx, conversation_trace) -> bool:
if not scheduling_ctx.get("conflicts"):
# 没有冲突 → 自动通过
return True
proposed = scheduling_ctx.get("proposed_alternatives", [])
return len(proposed) > 0
def _check_explanation(conversation_trace, conversation_stage: str) -> bool:
if not conversation_trace:
return False
last_response = conversation_trace[-1].get("response", "")
# 静默崩溃是坏的
if conversation_stage == "failed" and len(last_response) 20
关键规则: 每个检查在查看追踪记录时都应显而易见地为 True 或 False。
3. 步骤 2 – 将业务优先级转化为权重
并非所有标准同等重要。在排程示例中:
COMPLETION_WEIGHTS = {
"correct_participants": 0.25, # 错误的人员或时间是灾难性的
"correct_time": 0.25,
"correct_duration": 0.10, # 略微错误的时长会让人烦恼
"explored_alternatives": 0.20, # 建立用户信任
"clear_explanation": 0.20, # 建立用户信任
}
设计权重的指南
- 从业务影响出发,而不是检查的容易程度。
- 让权重之和为 1.0,这样分数更直观。
- 将标准数量保持在适度范围(4–7 条)。
- 在观察真实数据后,愿意调整权重。
4. 步骤 3 – 实现每请求评估器
将布尔检查与权重结合,计算单个请求的分数。
一种方便的表示方式是 EvaluationResult 数据类:
from dataclasses import dataclass
from enum import Enum
from typing import Dict
class OutcomeType(Enum):
SUCCESSFUL_COMPLETION = "successful_completion"
GRACEFUL_FAILURE = "graceful_failure"
PARTIAL_FAILURE = "partial_failure"
HARD_FAILURE = "hard_failure"
@dataclass
class EvaluationResult:
score: float # 0.0 到 1.0
details: Dict[str, bool] # 标准 -> 是否通过?
outcome_type: OutcomeType
explanation: str
核心评估函数
def evaluate_task_completion(final_state, ground_truth, conversation_trace) -> EvaluationResult:
scheduling_ctx = final_state.get("scheduling_context", {})
conversation_stage = final_state.get("conversation_stage", "unknown")
checks = {
"correct_participants": _check_participants(scheduling_ctx, ground_truth),
"correct_time": _check_time(scheduling_ctx, ground_truth),
"correct_duration": _check_duration(scheduling_ctx, ground_truth),
"explored_alternatives": _check_alternatives(scheduling_ctx, conversation_trace),
"clear_explanation": _check_explanation(conversation_trace, conversation_stage),
}
score = sum(
COMPLETION_WEIGHTS[k]
for k, v in checks.items()
if v
)
outcome = _classify_outcome(scheduling_ctx, conversation_stage, score)
explanation = _generate_explanation(checks, outcome, score)
return EvaluationResult(
score=score,
details=checks,
outcome_type=outcome,
explanation=explanation,
)
该函数返回:
- 用于分析和阈值判断的数值 分数。
- 用于调试的 details 字典。
- 用于报告或控制台输出的 human‑friendly explanation(人性化解释)。
5. 步骤 4 – 将分数映射到结果类别
(_classify_outcome 与 _generate_explanation 的实现是特定领域的,但通常遵循第 1 节中描述的分数阈值。)