사실을 반환하고, 해석을 반환하지 말라: LLM 도구는 당신이 생각하는 것보다 더 둔해야 한다
Source: Dev.to
Part 1: The Problem
1.1 The Helpful Tool That Made Everything Worse
When I first built resolve_container for Verdex, I wanted it to be helpful. The tool walks up the DOM tree from a target element and returns the ancestor chain. I added interpretation:
{
"type": "product-card", // Tool guesses semantic meaning
"role": "list-item", // Tool guesses structural purpose
"confidence": 0.85, // Tool evaluates its own guess
"recommendation": "Use this as your container scope"
}
At first this seemed reasonable—the tool pre‑analyzed the structure and made recommendations so the LLM didn’t have to.
In production, a page with user‑profile cards was mislabeled as “product‑card” (confidence 0.85). The LLM trusted the tool’s interpretation, generated selectors scoped to the wrong pattern, and tests broke on edge cases.
The issue wasn’t that the interpretation was usually wrong—it was usually right. The issue was fundamental: interpretation is context‑dependent, and the tool lacked context.
1.2 The Core Problem: Interpretation Is Context‑Dependent
Consider a <div data-testid="product-card">. What does it mean?
| 작업 | 해석 |
|---|---|
| Selector authoring | 안정적인 컨테이너; getByTestId("product-card")를 스코프로 사용 |
| Visual testing | 컴포넌트 경계; 전체 카드를 스크린샷 |
| Web scraping | 데이터 구조; 하위 요소에서 제품 정보를 추출 |
| Accessibility auditing | 의미론적 그룹화; ARIA 레이블 확인 |
The same DOM element yields four completely different meanings. When my tool chose one interpretation (“this is a product card, use it for selector scoping”), it applied that decision to all tasks, even though it lacked the user’s query, domain knowledge, and task context. Only the LLM possessed that information.
1.3 The Insight: Capability vs. Interpretation
Tools provide capability: Access to structural facts that would otherwise be hidden or expensive to retrieve.
LLMs provide interpretation: Deciding what those facts mean for a specific query in a specific context.
The tool’s job is to traverse the DOM and return what it finds—tags, attributes, depth, relationships—not to guess semantic types or prescribe usage.
Before (interpretation mixed in)
{
"container": {
"semanticType": "product-card", // Tool is guessing
"stability": "high", // Tool is evaluating
"recommended": true // Tool is prescribing
}
}
After (pure facts)
{
"ancestors": [
{
"level": 1,
"tagName": "div",
"attributes": { "data-testid": "product-card" },
"childElements": 5
}
]
}
The second version seems less “helpful” for humans, but it is more useful for LLMs because it preserves optionality. The same raw facts can be interpreted differently depending on the user’s query:
- Selector authoring: “That
data-testidat level 1 is a stable container; I’ll use it for scoping.” - Debugging: “There are 12 elements with that testid, probably a component copied without updating IDs.”
- Refactoring: “This pattern appears in 47 test files; needs careful migration.”
The architecture works because the capability layer stays interpretation‑free.
Part 2: Why This Matters
2.1 The Composition Problem
When tools return interpretations, they make decisions that are hard to reverse. The LLM must either accept the interpretation or fight against it—both costly.
Example: The tool outputs "type": "product-card" with "confidence": 0.85. The user asks, “Find all user profile cards on this page.” The LLM sees the tool’s interpretation and has two options:
- Trust it (wrong): Generate selectors for product cards.
- Fight it (awkward, token‑expensive, unreliable): Explain why the tool’s interpretation doesn’t match the query.
If the tool returns raw facts ("data-testid": "product-card"), the LLM can examine the actual page structure, realize the testid is misleading, and adapt accordingly.
Principle: Tools that return facts compose across different tasks; tools that return interpretations optimize for one task and break others.
2.2 The Human API Trap
Human‑facing APIs are deliberately high‑level and opinionated. A method like page.selectDropdown("Country", "United States") is beautiful for developers because it hides fiddly details.
LLMs, however, work better with low‑level primitives:
page.click('select[name="country"]')
page.click('option:has-text("United States")')
Low‑level actions let the LLM adapt patterns to novel components, custom frameworks, or non‑standard implementations. High‑level abstractions only work for the exact cases they were designed for; they constrain the LLM’s flexibility.
This is why resolve_container should return ancestor chains with raw attributes rather than “here’s your recommended container.” The LLM can then decide how to use that information for any downstream task.