사실을 반환하고, 해석을 반환하지 말라: LLM 도구는 당신이 생각하는 것보다 더 둔해야 한다

발행: 1개월 전 (2025년 12월 11일 오전 07:57 GMT+9)

4 min read

Source: Dev.to

Part 1: The Problem

1.1 The Helpful Tool That Made Everything Worse

When I first built resolve_container for Verdex, I wanted it to be helpful. The tool walks up the DOM tree from a target element and returns the ancestor chain. I added interpretation:

{
  "type": "product-card",           // Tool guesses semantic meaning
  "role": "list-item",              // Tool guesses structural purpose
  "confidence": 0.85,               // Tool evaluates its own guess
  "recommendation": "Use this as your container scope"
}

At first this seemed reasonable—the tool pre‑analyzed the structure and made recommendations so the LLM didn’t have to.

In production, a page with user‑profile cards was mislabeled as “product‑card” (confidence 0.85). The LLM trusted the tool’s interpretation, generated selectors scoped to the wrong pattern, and tests broke on edge cases.

The issue wasn’t that the interpretation was usually wrong—it was usually right. The issue was fundamental: interpretation is context‑dependent, and the tool lacked context.

1.2 The Core Problem: Interpretation Is Context‑Dependent

Consider a <div data-testid="product-card">. What does it mean?

작업	해석
Selector authoring	안정적인 컨테이너; `getByTestId("product-card")`를 스코프로 사용
Visual testing	컴포넌트 경계; 전체 카드를 스크린샷
Web scraping	데이터 구조; 하위 요소에서 제품 정보를 추출
Accessibility auditing	의미론적 그룹화; ARIA 레이블 확인

The same DOM element yields four completely different meanings. When my tool chose one interpretation (“this is a product card, use it for selector scoping”), it applied that decision to all tasks, even though it lacked the user’s query, domain knowledge, and task context. Only the LLM possessed that information.

1.3 The Insight: Capability vs. Interpretation

Tools provide capability: Access to structural facts that would otherwise be hidden or expensive to retrieve.
LLMs provide interpretation: Deciding what those facts mean for a specific query in a specific context.

The tool’s job is to traverse the DOM and return what it finds—tags, attributes, depth, relationships—not to guess semantic types or prescribe usage.

Before (interpretation mixed in)

{
  "container": {
    "semanticType": "product-card",  // Tool is guessing
    "stability": "high",             // Tool is evaluating
    "recommended": true              // Tool is prescribing
  }
}

After (pure facts)

{
  "ancestors": [
    {
      "level": 1,
      "tagName": "div",
      "attributes": { "data-testid": "product-card" },
      "childElements": 5
    }
  ]
}

The second version seems less “helpful” for humans, but it is more useful for LLMs because it preserves optionality. The same raw facts can be interpreted differently depending on the user’s query:

Selector authoring: “That data-testid at level 1 is a stable container; I’ll use it for scoping.”
Debugging: “There are 12 elements with that testid, probably a component copied without updating IDs.”
Refactoring: “This pattern appears in 47 test files; needs careful migration.”

The architecture works because the capability layer stays interpretation‑free.

Part 2: Why This Matters

2.1 The Composition Problem

When tools return interpretations, they make decisions that are hard to reverse. The LLM must either accept the interpretation or fight against it—both costly.

Example: The tool outputs "type": "product-card" with "confidence": 0.85. The user asks, “Find all user profile cards on this page.” The LLM sees the tool’s interpretation and has two options:

Trust it (wrong): Generate selectors for product cards.
Fight it (awkward, token‑expensive, unreliable): Explain why the tool’s interpretation doesn’t match the query.

If the tool returns raw facts ("data-testid": "product-card"), the LLM can examine the actual page structure, realize the testid is misleading, and adapt accordingly.

Principle: Tools that return facts compose across different tasks; tools that return interpretations optimize for one task and break others.

2.2 The Human API Trap

Human‑facing APIs are deliberately high‑level and opinionated. A method like page.selectDropdown("Country", "United States") is beautiful for developers because it hides fiddly details.

LLMs, however, work better with low‑level primitives:

page.click('select[name="country"]')
page.click('option:has-text("United States")')

Low‑level actions let the LLM adapt patterns to novel components, custom frameworks, or non‑standard implementations. High‑level abstractions only work for the exact cases they were designed for; they constrain the LLM’s flexibility.

This is why resolve_container should return ancestor chains with raw attributes rather than “here’s your recommended container.” The LLM can then decide how to use that information for any downstream task.

사실을 반환하고, 해석을 반환하지 말라: LLM 도구는 당신이 생각하는 것보다 더 둔해야 한다

Part 1: The Problem

1.1 The Helpful Tool That Made Everything Worse

1.2 The Core Problem: Interpretation Is Context‑Dependent

1.3 The Insight: Capability vs. Interpretation

Before (interpretation mixed in)

After (pure facts)

Part 2: Why This Matters

2.1 The Composition Problem

2.2 The Human API Trap

관련 글

LLM에 가드레일을 적용하세요

Anthropic Skills. 새로운 모델 및 아키텍처를 위한 전반적 상황

프롬프트에서 행동으로: Google & Kaggle AI Agents 부트캠프를 통한 나의 여정

스탠포드, 8단어로 프롬프트 엔지니어링을 종결