왜 당신의 AI가 safety constraints를 무시하는가 (그리고 'Intent'를 설계해 해결한 방법)

발행: 3일 전 (2026년 2월 25일 오후 05:31 GMT+9)

5 분 소요

Source: Dev.to

만약 LLM에 프롬프트를 넣어 본 적이 있다면, 아마도 다음과 같은 답답한 상황을 겪어봤을 것입니다: AI에게 “안전성, 명확성, 간결성”을 우선시하라고 지시했을 때,
모델이 문장을 더 명확하게 만들지, 더 안전하게 만들지 선택해야 할 경우, 일반적인 프롬프트는 이 목표들을 동등한 우선순위로 취급합니다—마치 동전을 던지는 것처럼 말이죠.

현재 우리는 목표를 평평하고 쉼표로 구분된 리스트 형태로 LLM에 전달하고 있습니다. AI는 “안전성”과 “간결성”을 동등한 우선순위로 인식하며, 의료 안전 제약이 재치 있는 문장보다 훨씬 높은 우선순위를 가진다는 사실을 모델에게 알려줄 내재된 메커니즘이 없습니다. 여러분이 의도하는 바와 모델이 듣는 바 사이의 이 격차는 신뢰할 수 있는 AI를 구축하는 데 큰 문제입니다.

우리는 최근 Intent Engineering이라는 시스템을 구축하고 Value Hierarchies를 활용함으로써 이 문제를 해결했습니다. 아래에서는 이 시스템이 어떻게 작동하는지, 왜 중요한지, 그리고 여러분의 AI에게 기계가 읽을 수 있는 “양심”을 부여하는 방법을 자세히 설명합니다.

문제: AI 목표는 순서가 없음

목표에는 순위가 없습니다. 예를 들어:

optimize(goals="clarity, safety")

두 목표를 동일하게 취급합니다.

데이터 구조

from enum import Enum
from typing import List, Optional
from pydantic import BaseModel

class PriorityLabel(str, Enum):
    NON_NEGOTIABLE = "NON_NEGOTIABLE"  # Forces the smartest routing tier
    HIGH           = "HIGH"            # Forces at least a hybrid tier
    MEDIUM         = "MEDIUM"          # Prompt‑level guidance only
    LOW            = "LOW"             # Prompt‑level guidance only

class HierarchyEntry(BaseModel):
    goal: str
    label: PriorityLabel
    description: Optional[str] = None

class ValueHierarchy(BaseModel):
    name: Optional[str] = None
    entries: List[HierarchyEntry]
    conflict_rule: Optional[str] = None

이렇게 데이터를 구조화하면 두 가지 핵심 단계에서 이러한 규칙을 AI 행동에 주입할 수 있습니다.

레벨 1: AI의 “두뇌” 변경 (프롬프트 인젝션)

...existing system prompt...

INTENT ENGINEERING DIRECTIVES (user‑defined — enforce strictly):
When optimization goals conflict, resolve in this order:
  1. [NON_NEGOTIABLE] safety: Always prioritise safety
  2. [HIGH] clarity
  3. [MEDIUM] conciseness

Conflict resolution: Safety first, always.

기술적 참고: entry.label.value를 사용하는 이유는 Python 3.11+에서 문자열‑서브클래싱 열거형이 작동하는 방식이 변경되었기 때문입니다. 이는 프롬프트가 정확히 문자열 "NON_NEGOTIABLE"를 받도록 보장합니다.

Level 2: “Bouncer”(라우팅 티어)

우리는 Router Tier Floor를 구축했습니다. 목표에 NON_NEGOTIABLE 태그를 붙이면, 시스템이 수학적으로 해당 요청이 하위 티어 모델로 라우팅되는 것을 방지합니다.

# Calculate the base score for the prompt 
score = await self._calculate_routing_score(prompt, context, ...)

# The Floor: Only fires when a hierarchy is active:
if value_hierarchy and value_hierarchy.entries:
    has_non_negotiable = any(
        e.label == PriorityLabel.NON_NEGOTIABLE for e in value_hierarchy.entries
    )
    has_high = any(
        e.label == PriorityLabel.HIGH for e in value_hierarchy.entries
    )

    # Force the request to a smarter model tier based on priority
    if has_non_negotiable:
        score["final_score"] = max(score.get("final_score", 0.0), 0.72)  # Guaranteed LLM
    elif has_high:
        score["final_score"] = max(score.get("final_score", 0.0), 0.45)  # Guaranteed Hybrid

def _hierarchy_fingerprint(value_hierarchy) -> str:
    if not value_hierarchy or not value_hierarchy.entries:
        return ""   # empty string → same cache key as usual
    return hashlib.md5(
        json.dumps(
            [{"goal": e.goal, "label": str(e.label)} for e in value_hierarchy.entries],
            sort_keys=True
        ).encode()
    ).hexdigest()[:8]

실전 적용 (MCP 통합)

{
  "tool": "define_value_hierarchy",
  "arguments": {
    "name": "Medical Safety Stack",
    "entries": [
      { "goal": "safety", "label": "NON_NEGOTIABLE", "description": "Always prioritise patient safety" },
      { "goal": "clarity", "label": "HIGH" },
      { "goal": "conciseness", "label": "MEDIUM" }
    ],
    "conflict_rule": "Safety first, always."
  }
}

요약

이 접근 방식을 실험하고 싶다면 Prompt Optimizer를 설치하세요:

npm install -g mcp-prompt-optimizer

자신의 파이프라인에서 충돌하는 제약 조건을 어떻게 처리하고 있는지 자유롭게 공유해주세요!

왜 당신의 AI가 safety constraints를 무시하는가 (그리고 'Intent'를 설계해 해결한 방법)

문제: AI 목표는 순서가 없음

데이터 구조

레벨 1: AI의 “두뇌” 변경 (프롬프트 인젝션)

Level 2: “Bouncer”(라우팅 티어)

실전 적용 (MCP 통합)

요약

관련 글

메모리 스캐폴딩이 LLM 추론을 형성한다: 지속적인 컨텍스트가 AI가 구축하는 방식을 어떻게 바꾸는가

우리는 출시 전에 자체 AI 에이전트 가드레일을 스트레스 테스트했습니다. 파손된 부분은 다음과 같습니다.

AI 모델 간 탭 전환을 멈추고 이 도구를 사용해 보세요

챗봇을 넘어: 신뢰할 수 있는 AI를 위한 청사진

문제: AI 목표는 순서가 없음

데이터 구조

레벨 1: AI의 “두뇌” 변경 (프롬프트 인젝션)

Level 2: “Bouncer”(라우팅 티어)

실전 적용 (MCP 통합)

요약

관련 글

메모리 스캐폴딩이 LLM 추론을 형성한다: 지속적인 컨텍스트가 AI가 구축하는 방식을 어떻게 바꾸는가

우리는 출시 전에 자체 AI 에이전트 가드레일을 스트레스 테스트했습니다. 파손된 부분은 다음과 같습니다.

AI 모델 간 탭 전환을 멈추고 이 도구를 사용해 보세요

챗봇을 넘어: 신뢰할 수 있는 AI를 위한 청사진

레벨 1: AI의 “두뇌” 변경 (프롬프트 인젝션)

Level 2: “Bouncer”(라우팅 티어)