이진 가중 평가...방법

발행: 2일 전 (2025년 12월 7일 오후 04:44 GMT+9)

5 min read

Source: Dev.to

1. 바이너리 가중 평가란 무엇인가?

큰 그림으로 보면:

작업에 대해 이진 기준 집합을 정의합니다. 각 기준은 True 또는 False 로 답할 수 있는 질문입니다.

예시 기준

correct_participants   # 에이전트가 올바른 사람들을 예약했는가?
clear_explanation      # 에이전트가 결과를 명확히 설명했는가?

각 기준에 가중치를 할당하여 중요도를 반영합니다. 모든 가중치의 합은 일반적으로 1.0 입니다.

COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,
    "correct_time": 0.25,
    "correct_duration": 0.10,
    "explored_alternatives": 0.20,
    "clear_explanation": 0.20,
}

각 작업에 대해 True인 모든 기준의 가중치를 합산하여 0.0~1.0 사이의 점수를 계산합니다.

score = sum(
    COMPLETION_WEIGHTS[k]
    for k, v in checks.items()
    if v
)

점수에 따라 결과를 분류합니다:

점수 범위	분류
`score >= 0.75` 및 예약 확인됨	성공적인 완료
`score >= 0.50`	우아한 실패
(원문에서 생략된 기타 범위)	(명시되지 않음)

# checks 딕셔너리의 예상 형태
checks: Dict[str, bool] = {
    "correct_participants": ... -> bool,
    "correct_time": ... -> bool,
    "correct_duration": ... -> bool,
    "explored_alternatives": ... -> bool,
    "clear_explanation": ... -> bool,
}

구현 예시

def _check_participants(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    booked = set(scheduling_ctx["booked_event"]["participants"])
    expected = set(ground_truth["participants"])
    return booked == expected

def _check_time(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    return scheduling_ctx["booked_event"]["time"] == ground_truth["time"]

def _check_duration(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    expected = ground_truth.get("duration", 30)
    return scheduling_ctx["booked_event"]["duration"] == expected

def _check_alternatives(scheduling_ctx, conversation_trace) -> bool:
    if not scheduling_ctx.get("conflicts"):
        # 충돌 없음 → 자동으로 OK
        return True
    proposed = scheduling_ctx.get("proposed_alternatives", [])
    return len(proposed) > 0

def _check_explanation(conversation_trace, conversation_stage: str) -> bool:
    if not conversation_trace:
        return False
    last_response = conversation_trace[-1].get("response", "")
    # 무음 충돌은 나쁨
    if conversation_stage == "failed" and len(last_response)  20

핵심 규칙: 각 검사는 트레이스를 보면 명확히 True 또는 False 가 되도록 해야 합니다.

3. 단계 2 – 비즈니스 우선순위를 가중치로 전환

모든 기준이 동일하게 중요한 것은 아닙니다. 스케줄링 예시에서는 다음과 같이 가중치를 설정합니다:

COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,   # 잘못된 사람이나 시간은 치명적
    "correct_time": 0.25,
    "correct_duration": 0.10,       # 약간 틀린 기간은 불편함
    "explored_alternatives": 0.20,   # 사용자 신뢰 구축
    "clear_explanation": 0.20,       # 사용자 신뢰 구축
}

가중치 설계 가이드라인

비즈니스 영향에서 시작하고, 검사 용이성에서 시작하지 마세요.
점수가 직관적이도록 가중치 합을 1.0 으로 맞추세요.
기준 수는 적당히 유지(4~7개).
실제 데이터를 관찰한 뒤 가중치를 조정할 준비를 하세요.

4. 단계 3 – 요청당 평가자 구현

불리언 검증과 가중치를 결합해 단일 요청에 대한 점수를 계산합니다.
편리한 표현으로 EvaluationResult 데이터 클래스를 사용할 수 있습니다:

from dataclasses import dataclass
from enum import Enum
from typing import Dict

class OutcomeType(Enum):
    SUCCESSFUL_COMPLETION = "successful_completion"
    GRACEFUL_FAILURE = "graceful_failure"
    PARTIAL_FAILURE = "partial_failure"
    HARD_FAILURE = "hard_failure"

@dataclass
class EvaluationResult:
    score: float                 # 0.0 to 1.0
    details: Dict[str, bool]     # criterion -> passed?
    outcome_type: OutcomeType
    explanation: str

핵심 평가 함수

def evaluate_task_completion(final_state, ground_truth, conversation_trace) -> EvaluationResult:
    scheduling_ctx = final_state.get("scheduling_context", {})
    conversation_stage = final_state.get("conversation_stage", "unknown")

    checks = {
        "correct_participants": _check_participants(scheduling_ctx, ground_truth),
        "correct_time": _check_time(scheduling_ctx, ground_truth),
        "correct_duration": _check_duration(scheduling_ctx, ground_truth),
        "explored_alternatives": _check_alternatives(scheduling_ctx, conversation_trace),
        "clear_explanation": _check_explanation(conversation_trace, conversation_stage),
    }

    score = sum(
        COMPLETION_WEIGHTS[k]
        for k, v in checks.items()
        if v
    )

    outcome = _classify_outcome(scheduling_ctx, conversation_stage, score)
    explanation = _generate_explanation(checks, outcome, score)

    return EvaluationResult(
        score=score,
        details=checks,
        outcome_type=outcome,
        explanation=explanation,
    )

이 함수는 다음을 반환합니다:

분석 및 임계값 설정을 위한 점수(숫자).
디버깅을 위한 details 딕셔너리.
보고서나 콘솔 출력에 사용할 수 있는 인간 친화적인 explanation.

5. 단계 4 – 점수를 결과 클래스에 매핑

(_classify_outcome 및 _generate_explanation 구현은 도메인에 따라 다르지만, 일반적으로 섹션 1에서 설명한 점수 임계값을 따릅니다.)

이진 가중 평가...방법

1. 바이너리 가중 평가란 무엇인가?

구현 예시

3. 단계 2 – 비즈니스 우선순위를 가중치로 전환

가중치 설계 가이드라인

4. 단계 3 – 요청당 평가자 구현

핵심 평가 함수

5. 단계 4 – 점수를 결과 클래스에 매핑

관련 글

System prompts를 Ground Truth로 사용하여 평가하는 방법

Launch HN: Mentat (YC S16) – 런타임 개입으로 LLM 제어

2025년 최고의 AI Background Generator: 맞춤 배경을 즉시 만들기

Replicate에서 Ideogram-Ai의 Ideogram-V3-Turbo 모델 초보자 가이드