Binary weighted evaluations...how to

Published: 3 days ago (December 7, 2025 at 02:44 AM EST)

3 min read

Source: Dev.to

1. What is a binary weighted evaluation?

At a high level:

Define a set of binary criteria for a task. Each criterion is a question that can be answered with True or False.

Example criteria

correct_participants   # Did the agent book the right people?
clear_explanation      # Did the agent explain the outcome clearly?

Assign each criterion a weight that reflects its importance. All weights typically sum to 1.0.

COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,
    "correct_time": 0.25,
    "correct_duration": 0.10,
    "explored_alternatives": 0.20,
    "clear_explanation": 0.20,
}

For each task, compute a score from 0.0 to 1.0 by summing the weights of all criteria that are True.

score = sum(
    COMPLETION_WEIGHTS[k]
    for k, v in checks.items()
    if v
)

Classify the outcome based on the score:

Score range	Classification
`score >= 0.75` and booking confirmed	Successful completion
`score >= 0.50`	Graceful failure
(other ranges omitted in original text)	(not specified)

# Expected shape of the checks dictionary
checks: Dict[str, bool] = {
    "correct_participants": ... -> bool,
    "correct_time": ... -> bool,
    "correct_duration": ... -> bool,
    "explored_alternatives": ... -> bool,
    "clear_explanation": ... -> bool,
}

Example implementations

def _check_participants(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    booked = set(scheduling_ctx["booked_event"]["participants"])
    expected = set(ground_truth["participants"])
    return booked == expected

def _check_time(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    return scheduling_ctx["booked_event"]["time"] == ground_truth["time"]

def _check_duration(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    expected = ground_truth.get("duration", 30)
    return scheduling_ctx["booked_event"]["duration"] == expected

def _check_alternatives(scheduling_ctx, conversation_trace) -> bool:
    if not scheduling_ctx.get("conflicts"):
        # No conflict → automatically ok
        return True
    proposed = scheduling_ctx.get("proposed_alternatives", [])
    return len(proposed) > 0

def _check_explanation(conversation_trace, conversation_stage: str) -> bool:
    if not conversation_trace:
        return False
    last_response = conversation_trace[-1].get("response", "")
    # Silent crash is bad
    if conversation_stage == "failed" and len(last_response)  20

Key rule: each check should be obviously True or False when you look at the trace.

3. Step 2 – Turn business priorities into weights

Not all criteria are equally important. In the scheduling example:

COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,   # Wrong person or time is catastrophic
    "correct_time": 0.25,
    "correct_duration": 0.10,       # Slightly wrong duration is annoying
    "explored_alternatives": 0.20,   # Builds user trust
    "clear_explanation": 0.20,       # Builds user trust
}

Guidelines for designing weights

Start from business impact, not from ease of checking.
Make weights sum to 1.0 so the score is intuitive.
Keep the number of criteria modest (4–7).
Be willing to adjust weights after observing real data.

4. Step 3 – Implement the per‑request evaluator

Combine the boolean checks and weights to compute a score for a single request.
A convenient representation is an EvaluationResult dataclass:

from dataclasses import dataclass
from enum import Enum
from typing import Dict

class OutcomeType(Enum):
    SUCCESSFUL_COMPLETION = "successful_completion"
    GRACEFUL_FAILURE = "graceful_failure"
    PARTIAL_FAILURE = "partial_failure"
    HARD_FAILURE = "hard_failure"

@dataclass
class EvaluationResult:
    score: float                 # 0.0 to 1.0
    details: Dict[str, bool]     # criterion -> passed?
    outcome_type: OutcomeType
    explanation: str

Core evaluation function

def evaluate_task_completion(final_state, ground_truth, conversation_trace) -> EvaluationResult:
    scheduling_ctx = final_state.get("scheduling_context", {})
    conversation_stage = final_state.get("conversation_stage", "unknown")

    checks = {
        "correct_participants": _check_participants(scheduling_ctx, ground_truth),
        "correct_time": _check_time(scheduling_ctx, ground_truth),
        "correct_duration": _check_duration(scheduling_ctx, ground_truth),
        "explored_alternatives": _check_alternatives(scheduling_ctx, conversation_trace),
        "clear_explanation": _check_explanation(conversation_trace, conversation_stage),
    }

    score = sum(
        COMPLETION_WEIGHTS[k]
        for k, v in checks.items()
        if v
    )

    outcome = _classify_outcome(scheduling_ctx, conversation_stage, score)
    explanation = _generate_explanation(checks, outcome, score)

    return EvaluationResult(
        score=score,
        details=checks,
        outcome_type=outcome,
        explanation=explanation,
    )

The function returns:

A numeric score for analytics and thresholds.
A details dict for debugging.
A human‑friendly explanation for reports or console output.

5. Step 4 – Map scores to outcome classes

(Implementation of _classify_outcome and _generate_explanation is domain‑specific, but typically follows the score thresholds described in Section 1.)

Binary weighted evaluations...how to

1. What is a binary weighted evaluation?

Example implementations

3. Step 2 – Turn business priorities into weights

Guidelines for designing weights

4. Step 3 – Implement the per‑request evaluator

Core evaluation function

5. Step 4 – Map scores to outcome classes

Related posts

How to use System prompts as Ground Truth for Evaluation

Launch HN: Mentat (YC S16) – Controlling LLMs with Runtime Intervention

Best AI Background Generator for 2025: Create Custom Backgrounds Instantly

A beginner's guide to the Ideogram-V3-Turbo model by Ideogram-Ai on Replicate