Binary weighted evaluations...how to

Published: (December 7, 2025 at 02:44 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

1. What is a binary weighted evaluation?

At a high level:

  • Define a set of binary criteria for a task. Each criterion is a question that can be answered with True or False.

Example criteria

correct_participants   # Did the agent book the right people?
clear_explanation      # Did the agent explain the outcome clearly?
  • Assign each criterion a weight that reflects its importance. All weights typically sum to 1.0.
COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,
    "correct_time": 0.25,
    "correct_duration": 0.10,
    "explored_alternatives": 0.20,
    "clear_explanation": 0.20,
}
  • For each task, compute a score from 0.0 to 1.0 by summing the weights of all criteria that are True.
score = sum(
    COMPLETION_WEIGHTS[k]
    for k, v in checks.items()
    if v
)
  • Classify the outcome based on the score:
Score rangeClassification
score >= 0.75 and booking confirmedSuccessful completion
score >= 0.50Graceful failure
(other ranges omitted in original text)(not specified)
# Expected shape of the checks dictionary
checks: Dict[str, bool] = {
    "correct_participants": ... -> bool,
    "correct_time": ... -> bool,
    "correct_duration": ... -> bool,
    "explored_alternatives": ... -> bool,
    "clear_explanation": ... -> bool,
}

Example implementations

def _check_participants(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    booked = set(scheduling_ctx["booked_event"]["participants"])
    expected = set(ground_truth["participants"])
    return booked == expected

def _check_time(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    return scheduling_ctx["booked_event"]["time"] == ground_truth["time"]

def _check_duration(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    expected = ground_truth.get("duration", 30)
    return scheduling_ctx["booked_event"]["duration"] == expected
def _check_alternatives(scheduling_ctx, conversation_trace) -> bool:
    if not scheduling_ctx.get("conflicts"):
        # No conflict → automatically ok
        return True
    proposed = scheduling_ctx.get("proposed_alternatives", [])
    return len(proposed) > 0

def _check_explanation(conversation_trace, conversation_stage: str) -> bool:
    if not conversation_trace:
        return False
    last_response = conversation_trace[-1].get("response", "")
    # Silent crash is bad
    if conversation_stage == "failed" and len(last_response)  20

Key rule: each check should be obviously True or False when you look at the trace.

3. Step 2 – Turn business priorities into weights

Not all criteria are equally important. In the scheduling example:

COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,   # Wrong person or time is catastrophic
    "correct_time": 0.25,
    "correct_duration": 0.10,       # Slightly wrong duration is annoying
    "explored_alternatives": 0.20,   # Builds user trust
    "clear_explanation": 0.20,       # Builds user trust
}

Guidelines for designing weights

  • Start from business impact, not from ease of checking.
  • Make weights sum to 1.0 so the score is intuitive.
  • Keep the number of criteria modest (4–7).
  • Be willing to adjust weights after observing real data.

4. Step 3 – Implement the per‑request evaluator

Combine the boolean checks and weights to compute a score for a single request.
A convenient representation is an EvaluationResult dataclass:

from dataclasses import dataclass
from enum import Enum
from typing import Dict

class OutcomeType(Enum):
    SUCCESSFUL_COMPLETION = "successful_completion"
    GRACEFUL_FAILURE = "graceful_failure"
    PARTIAL_FAILURE = "partial_failure"
    HARD_FAILURE = "hard_failure"

@dataclass
class EvaluationResult:
    score: float                 # 0.0 to 1.0
    details: Dict[str, bool]     # criterion -> passed?
    outcome_type: OutcomeType
    explanation: str

Core evaluation function

def evaluate_task_completion(final_state, ground_truth, conversation_trace) -> EvaluationResult:
    scheduling_ctx = final_state.get("scheduling_context", {})
    conversation_stage = final_state.get("conversation_stage", "unknown")

    checks = {
        "correct_participants": _check_participants(scheduling_ctx, ground_truth),
        "correct_time": _check_time(scheduling_ctx, ground_truth),
        "correct_duration": _check_duration(scheduling_ctx, ground_truth),
        "explored_alternatives": _check_alternatives(scheduling_ctx, conversation_trace),
        "clear_explanation": _check_explanation(conversation_trace, conversation_stage),
    }

    score = sum(
        COMPLETION_WEIGHTS[k]
        for k, v in checks.items()
        if v
    )

    outcome = _classify_outcome(scheduling_ctx, conversation_stage, score)
    explanation = _generate_explanation(checks, outcome, score)

    return EvaluationResult(
        score=score,
        details=checks,
        outcome_type=outcome,
        explanation=explanation,
    )

The function returns:

  • A numeric score for analytics and thresholds.
  • A details dict for debugging.
  • A human‑friendly explanation for reports or console output.

5. Step 4 – Map scores to outcome classes

(Implementation of _classify_outcome and _generate_explanation is domain‑specific, but typically follows the score thresholds described in Section 1.)

Back to Blog

Related posts

Read more »