Binary weighted evaluations...how to
Source: Dev.to
1. What is a binary weighted evaluation?
At a high level:
- Define a set of binary criteria for a task. Each criterion is a question that can be answered with True or False.
Example criteria
correct_participants # Did the agent book the right people?
clear_explanation # Did the agent explain the outcome clearly?
- Assign each criterion a weight that reflects its importance. All weights typically sum to 1.0.
COMPLETION_WEIGHTS = {
"correct_participants": 0.25,
"correct_time": 0.25,
"correct_duration": 0.10,
"explored_alternatives": 0.20,
"clear_explanation": 0.20,
}
- For each task, compute a score from 0.0 to 1.0 by summing the weights of all criteria that are
True.
score = sum(
COMPLETION_WEIGHTS[k]
for k, v in checks.items()
if v
)
- Classify the outcome based on the score:
| Score range | Classification |
|---|---|
score >= 0.75 and booking confirmed | Successful completion |
score >= 0.50 | Graceful failure |
| (other ranges omitted in original text) | (not specified) |
# Expected shape of the checks dictionary
checks: Dict[str, bool] = {
"correct_participants": ... -> bool,
"correct_time": ... -> bool,
"correct_duration": ... -> bool,
"explored_alternatives": ... -> bool,
"clear_explanation": ... -> bool,
}
Example implementations
def _check_participants(scheduling_ctx, ground_truth) -> bool:
if not scheduling_ctx.get("booking_confirmed"):
return False
booked = set(scheduling_ctx["booked_event"]["participants"])
expected = set(ground_truth["participants"])
return booked == expected
def _check_time(scheduling_ctx, ground_truth) -> bool:
if not scheduling_ctx.get("booking_confirmed"):
return False
return scheduling_ctx["booked_event"]["time"] == ground_truth["time"]
def _check_duration(scheduling_ctx, ground_truth) -> bool:
if not scheduling_ctx.get("booking_confirmed"):
return False
expected = ground_truth.get("duration", 30)
return scheduling_ctx["booked_event"]["duration"] == expected
def _check_alternatives(scheduling_ctx, conversation_trace) -> bool:
if not scheduling_ctx.get("conflicts"):
# No conflict → automatically ok
return True
proposed = scheduling_ctx.get("proposed_alternatives", [])
return len(proposed) > 0
def _check_explanation(conversation_trace, conversation_stage: str) -> bool:
if not conversation_trace:
return False
last_response = conversation_trace[-1].get("response", "")
# Silent crash is bad
if conversation_stage == "failed" and len(last_response) 20
Key rule: each check should be obviously True or False when you look at the trace.
3. Step 2 – Turn business priorities into weights
Not all criteria are equally important. In the scheduling example:
COMPLETION_WEIGHTS = {
"correct_participants": 0.25, # Wrong person or time is catastrophic
"correct_time": 0.25,
"correct_duration": 0.10, # Slightly wrong duration is annoying
"explored_alternatives": 0.20, # Builds user trust
"clear_explanation": 0.20, # Builds user trust
}
Guidelines for designing weights
- Start from business impact, not from ease of checking.
- Make weights sum to 1.0 so the score is intuitive.
- Keep the number of criteria modest (4–7).
- Be willing to adjust weights after observing real data.
4. Step 3 – Implement the per‑request evaluator
Combine the boolean checks and weights to compute a score for a single request.
A convenient representation is an EvaluationResult dataclass:
from dataclasses import dataclass
from enum import Enum
from typing import Dict
class OutcomeType(Enum):
SUCCESSFUL_COMPLETION = "successful_completion"
GRACEFUL_FAILURE = "graceful_failure"
PARTIAL_FAILURE = "partial_failure"
HARD_FAILURE = "hard_failure"
@dataclass
class EvaluationResult:
score: float # 0.0 to 1.0
details: Dict[str, bool] # criterion -> passed?
outcome_type: OutcomeType
explanation: str
Core evaluation function
def evaluate_task_completion(final_state, ground_truth, conversation_trace) -> EvaluationResult:
scheduling_ctx = final_state.get("scheduling_context", {})
conversation_stage = final_state.get("conversation_stage", "unknown")
checks = {
"correct_participants": _check_participants(scheduling_ctx, ground_truth),
"correct_time": _check_time(scheduling_ctx, ground_truth),
"correct_duration": _check_duration(scheduling_ctx, ground_truth),
"explored_alternatives": _check_alternatives(scheduling_ctx, conversation_trace),
"clear_explanation": _check_explanation(conversation_trace, conversation_stage),
}
score = sum(
COMPLETION_WEIGHTS[k]
for k, v in checks.items()
if v
)
outcome = _classify_outcome(scheduling_ctx, conversation_stage, score)
explanation = _generate_explanation(checks, outcome, score)
return EvaluationResult(
score=score,
details=checks,
outcome_type=outcome,
explanation=explanation,
)
The function returns:
- A numeric score for analytics and thresholds.
- A details dict for debugging.
- A human‑friendly explanation for reports or console output.
5. Step 4 – Map scores to outcome classes
(Implementation of _classify_outcome and _generate_explanation is domain‑specific, but typically follows the score thresholds described in Section 1.)