사용자 문자열에서 질문 파서가 추출하는 요소: 키워드, 범위, 형태, 분해, 명확화

발행: 4시간 전 (2026년 6월 17일 PM 09:00 GMT+9)

13 분 소요

출처: Towards Data Science

질문 파싱 브릭의 일부로 Enterprise Document Intelligence, 네 개의 블록(파싱, 질문 파싱, 검색, 생성)으로 기업 RAG 시스템을 구축하는 시리즈입니다. 기사 6_a (논문)는 질문을 파싱해야 한다는 주장을 제시하고, 파싱된 행이 두 개의 간단한 요약으로 나뉜다고 보여줍니다. 이 문서는 사용자 문자열에서 파서가 추출하는 다섯 가지 항목들을 설명합니다. 키워드, 예상 답변 형태와 유형, 범위 힌트, 복합 질문에 대한 분해, 그리고 행동하기에 너무 모호한 입력을 위한 명확화 필드입니다. 기사 6_c (전송)에서는 파서가 해당 필드 위에 결정을 내리는 내용을 다루며, 이를 문서 프로파일을 사용합니다.

이 문서는 시리즈 내에서 위치해 있습니다. 기사 6 (질문 파싱), 추출 부분, Part II(네 개의 블록) 안에 포함되어 있습니다 – 저자 사진.

사용자가 한 줄의 문자열을 입력합니다. “What is the maximum coverage amount? Don’t confuse it with the deductible, they’re often listed together.” 파서는 이 텍스트를 다음과 같은 컬럼이 있는 행으로 변환합니다: 주제, 예상 답변 형태(금액), 범위 힌트(계약서), 부정 신호(deductible가 아님)을 생성 브리ーフ로 라우팅하고, 검색에 활용할 수 있는 레이아웃 힌트(함께 표시됨). 각 조각은 question_df의 별도 컬럼이 됩니다. 이 문서는 다섯 개의 필드 가족을 하나씩 설명하며, 각 컬럼을 채우는 코드와 이를 저장하는Typed 스키마를 제시합니다.

질문 파싱은 question_df에 한 행과 위성 테이블을 생성하고, 이 두 가지 뷰가 검색 및 생성에 활용됩니다 – 저자 사진.

1. Parser가 채우는 다섯 개의 필드 가족

질문은 단순히 단어 이상의 의미를 가집니다. 또한 답변이 어떤 형태를 가져야 하는지, 문서 내에서 어디에 찾아야 하는지, 복합 질문인지 아니면 행동하기에 너무 모호한지를 알려줍니다. 파서는 이러한 모든 정보를 캡처하고 question_df에 컬럼으로 작성합니다. 이 문서를 메뉴처럼 읽어 보세요. 체크리스트가 아니라요.

컬럼은 두 그룹으로 나뉩니다.

질문 자체에서 파서가 읽는 내용

키워드: 검색에 사용될 토큰입니다. 여러 출처에서 결합됩니다: 명시적(사용자가 직접 언급한), 직접(질문에서 추출된), LLM 재작성, 전문 개념 사전, 고신호 정규식 앵커(예: L131-1).

Here is the minimal schema we’ll grow as the section goes:

class ParsedQuestion(BaseModel):
    original_question: str
    keywords: list[str]

For “What is the maximum coverage amount?”, the parser produces:

ParsedQuestion(
    original_question="What is the maximum coverage amount?",
    keywords=["maximum",  "coverage",  "amount"],
)

키워드는 질문에서 그대로 존재하는 오타와 문법을 물려받습니다. “How does multi- head atention compare to self-atention?” 에서 바로 토큰을 추출하고, 검색은 atention이라는 단어를 찾지만 문서에 없으므로 결과가 나오지 않고, 시스템이何も返回하지 않아 사용자는 해당 주제가Coverage되지 않았다고 판단합니다.

해결책은 키워드 추출 이전에 실행되는 저렴한 사전 단계입니다. 의미를 바꾸지 않으면서 오타와 문법을 정정하는 LLM 호출 한 번만으로 키워드가 깨끗하게 나옵니다.

def correct_ spelling(question: str) -> str:
    """Fix typos and grammar without changing meaning.""".
    resp  = client.responses.create( model="gpt-4.1-mini", input=prompt) 
    return resp.output_text.strip()

생산 환경에서는 이 함수가 캐시됩니다(여러 사용자가 동일한 질문을 입력하면 한 번만 수정 후 캐시)하고, 입력이 깨끗할 경우 건너뛰어집니다.

일부 사용자(분석가, 파럴리갈, 문서 어휘에 익숙한 사람)들은 이미 정확히 매칭하고 싶은 용어를 알고 있습니다. UI 힌트, “검색할 정확한 용어를 쉼표로 구분해 주세요”, 시스템에서 가장 정밀한 검색 경로를 열어줍니다. 사용자의 토큰은 그대로 사용됩니다: 가중치 1.0, 출처 direct, LLM 없음, 동의어 확장 없음(선택적). “Please find ‘force majeure’, ‘rescission’, ‘event of default’ in this contract”“에 대해 파서는 세 개의 따옴표로 둘러싸인 구절을 그대로 추출합니다.

사용자가 용어를 직접 명시하면 LLM 재작성보다 빠르고 저렴하며 정확합니다. 제품 측면에서도 “검색어 입력란”을 질문 상자에 accanto 두거나, 시스템 프롬프트 지시(“매칭하고 싶은 정확한 용어를 포함하세요”)를 제공하면 이 경로로 이동하는Queries 비율이 측정 가능하게 됩니다.

사용자가 명시적으로 용어를 제시하지 않을 경우, 파서는 세 가지 내부 소스를 활용합니다: LLM 재작성, 전문 개념 사전, 정규식 앵커.

Vocabulary mismatch is the first thing to break.

The user asks about “the cap on what the insurer will pay” and the document says “limit of indemnity per occurrence.” The gap shows up everywhere in enterprise:

보험: ‘the cap on what the insurer will pay’ → ‘limit of indemnity per occurrence’.
법률: ‘what happens if we exit early’ → ‘early termination provisions’ 또는 ‘rights of rescission’.
금융: ‘how much we’ll get paid back’ → ‘principal repayment schedule’ 또는 ‘redemption terms’.
의료: ‘side effects’ → ‘adverse events’ 또는 ‘contraindications’.

키워드 컬럼에 잘못된 토큰이 들어가면 검색이 전혀 이루어지지 않습니다. 세 가지 소스가 결합되어 문서에서 사용하는 용어로 채웁니다.

Let users name the keywords themselves.

The user’s tokens go in verbatim: weight 1.0, source direct, no LLM, no synonym expansion (unless they opt in). For “Please find ‘force majeure’, ‘rescission’, ‘event of default’ in this contract”, the parser pulls the three quoted phrases as‑is.

Faster, cheaper, more accurate than any LLM rewrite when the user can name the terms. The product side matters too: a “search terms” field next to the question box, or a system prompt instruction (“include the exact terms you want matched”), moves a measurable share of queries onto this path.

When the user doesn’t name terms explicitly, three parser‑side sources fill in: LLM rewrites, an expert concept dictionary, and anchor regex.

1.2 Dispatch

What the parser then decides (using the document’s profile on top of the above).

Dispatch: How much surrounding context to read and return, which chunk strategy to use, which model to call. All cascaded from the answer type, the matched concept, and the project’s defaults.

The dispatcher determines how many surrounding tokens to retrieve, which chunking strategy (e.g., sliding window, hierarchical) to apply, and which language model to invoke. This decision is driven by the answer type (single value vs. listing), the matched concept identifier, and any project‑specific defaults such as maximum context length.

1.3 Activations

Activations: Which bricks to run (TOC navigation, embeddings, cross‑references, …), downgraded by what the document supports.

What the parser then decides (using the document’s profile on top of the above).

Dispatch: How much surrounding context to read and return, which chunk strategy to use, which model to call. All cascaded from the answer type, the matched concept, and the project’s defaults.

Each category becomes one or more columns on question_df. Projects pick what they need, skip the rest, and add new columns as failure modes show up: a policy_number for an insurance broker, a patient_id for medical RAG, a regulation_year for legal. The sub‑sections walk each one.

1.4 Keywords

Retrieval needs words to search the document with. The parser picks them out of the question and hands them over. The user’s wording almost never matches the document’s wording on the first try, so the parser collects from several sources at once.

Here is the minimal schema we’ll grow as the section goes:

class ParsedQuestion(BaseModel):
    original_question: str
    keywords: list[str]

For “What is the maximum coverage amount?”, the parser produces:

ParsedQuestion(
    original_question="What is the maximum coverage amount?",
    keywords=["maximum",  "coverage",  "amount"],
)

Keywords inherit whatever typos the question carries. Pull tokens straight from “How does multi- head atention compare to self-atention?” and retrieval searches for atention, a string the document never contains. Zero hits, the system returns nothing, the user concludes the topic isn’t covered.

The fix is a cheap pre‑step that runs before keyword extraction: one LLM call that corrects typos and grammar without changing meaning, so the keywords come out clean.

def correct_ spelling(question: str) -> str:
    """Fix typos and grammar without changing meaning.""".
    resp = client.responses.create( model="gpt-4.1-mini", input=prompt)
    return resp.output_text.strip()

In production, this is cached (the same question typed by multiple users gets corrected once) and skipped when the input is clean.

Some users (analysts, paralegals, anyone fluent in the document’s vocabulary) already know exactly which terms they want matched. A UI hint, “List exact terms to search for, separated by commas”, opens the highest‑precision retrieval path the system has. The user’s tokens go in verbatim: weight 1.0, source direct, no LLM, no synonym expansion (unless they opt in). For “Please find ‘force majeure’, ‘rescission’, ‘event of default’ in this contract”, the parser pulls the three quoted phrases as‑is.

When the user doesn’t name terms explicitly, three parser‑side sources fill in: LLM rewrites, an expert concept dictionary, and anchor regex.

Vocabulary mismatch is the first thing to break.

The user asks about “the cap on what the insurer will pay” and the document says “limit of indemnity per occurrence.” The gap shows up everywhere in enterprise:

보험: ‘the cap on what the insurer will pay’ → ‘limit of indemnity per occurrence’.
법률: ‘what happens if we exit early’ → ‘early termination provisions’ or ‘rights of rescission’.
금융: ‘how much we’ll get paid back’ → ‘principal repayment schedule’ or ‘redemption terms’.
의료: ‘side effects’ → ‘adverse events’ or ‘contraindications’.

The keyword column has the wrong tokens; the search misses everything. Three sources combine to fill it with terms the document uses.

Let users name the keywords themselves.

When the user doesn’t name terms explicitly, three parser‑side sources fill in: LLM rewrites, an expert concept dictionary, and anchor regex.

사용자 문자열에서 질문 파서가 추출하는 요소: 키워드, 범위, 형태, 분해, 명확화

1. Parser가 채우는 다섯 개의 필드 가족

질문 자체에서 파서가 읽는 내용

Vocabulary mismatch is the first thing to break.

Let users name the keywords themselves.

1.2 Dispatch

What the parser then decides (using the document’s profile on top of the above).

1.3 Activations

What the parser then decides (using the document’s profile on top of the above).

Activations: Which bricks to run (TOC navigation, embeddings, cross‑references, …), downgraded by what the document supports.

1.4 Keywords

Vocabulary mismatch is the first thing to break.

Let users name the keywords themselves.

관련 글

에이전트 아키텍처를 정리한 프로토콜

시스템은 언제나 알고, 로컬 효율과 시스템 성능은 별개의 문제

클로드 스킬에 포함할 4가지 문장

비전 LLMs도 PDF 파서… 차트·그래프 읽어 RAG 지원