Waterfall Pattern: 신뢰할 수 있는 데이터 추출을 위한 계층형 전략

발행: 1일 전 (2026년 2월 15일 오전 05:05 GMT+9)

11 분 소요

I’m happy to translate the article for you, but I’ll need the full text of the post (the paragraphs, headings, etc.) in order to do so. Could you please paste the article’s content here? I’ll keep the source link unchanged and preserve all formatting, code blocks, URLs, and technical terms as you requested.

워터폴 방법 – 탄력적인 스크래퍼 구축

새벽 3시, 프로덕션 스크래퍼가 막 크래시되었습니다. 로그를 보면 흔한 원인이 드러납니다: 대상 웹사이트의 개발자가 CSS 클래스명을 product-price에서 price‑v2‑red로 바꿨습니다. 5초 만에 끝난 겉보기에 사소한 변경이었지만, 전체 데이터 파이프라인을 망가뜨렸습니다.

시각적인 CSS 선택자에만 의존한다면, 변하는 모래 위에 짓는 것과 같습니다. 웹사이트는 끊임없이 변하고, 매번 디자인이 바뀔 때마다 유지보수 악몽이 됩니다. 탄력적인 스크래퍼를 구축하려면, “워터폴” 접근 방식을 사용하세요 — 포기하기 전에 여러 추출 방법을 차례대로 시도하는 계층형 우선순위 시스템입니다.

안정성 계층 구조

웹 페이지는 단순한 시각 문서가 아니라, 서로 다른 안정성 수준을 가진 여러 데이터 레이어로 구성됩니다. 워터폴 방법은 이러한 레이어를 가장 안정적인 것부터 가장 불안정한 것까지 우선순위를 매깁니다.

Tier	Layer	Why It’s Stable
Tier 1	Hidden Data (JSON‑LD / Script Tags)	SEO 또는 내부 JavaScript 프레임워크에 사용되는 구조화된 데이터는 사람보다는 기계를 위해 설계되었습니다. UI가 재설계될 때 거의 변경되지 않습니다.
Tier 2	Semantic Anchors (IDs / Data Attributes)	`id="product-123"` 또는 `data-testid="price-display"`와 같은 고유 식별자는 보통 데이터베이스 키나 자동 테스트 스위트와 연결됩니다. 개발자는 내부 도구를 깨뜨릴 수 있기 때문에 이를 거의 변경하지 않습니다.
Tier 3	Relational XPath	특정 ID가 없을 경우 레이블을 찾습니다. CSS 클래스는 변하지만 “Price:”라는 단어는 보통 그대로 유지됩니다. XPath는 그 텍스트를 찾아 옆에 있는 요소를 잡아낼 수 있습니다.
Tier 4	Visual Selectors (CSS Classes)	이것이 최후의 수단입니다. `.blue-text`와 같은 CSS 클래스는 디자이너가 새로운 모습을 원할 때마다 변경됩니다. 다른 모든 방법이 실패했을 때만 사용하십시오.

Tier 1부터 시작해 워터폴을 따라 내려가면 유지보수를 최소화하면서 성공 가능성을 극대화할 수 있습니다.

환경 설정

우리는 parsel(Scrapy를 구동하는 라이브러리)를 사용할 것입니다. 이 라이브러리는 하나의 객체 안에서 CSS, XPath, 정규 표현식을 모두 사용할 수 있게 해줍니다.

pip install parsel requests

모의 HTML 스니펫

다음 HTML 조각은 전체 가이드에서 계속 사용됩니다. 이는 여러 데이터 레이어를 가진 일반적인 전자상거래 페이지를 나타냅니다:

html_content = """

    
        
        {
            "@context": "https://schema.org/",
            "@type": "Product",
            "name": "Ultimate Coffee Grinder",
            "sku": "GRND-99",
            "offers": {
                "price": "89.99",
                "priceCurrency": "USD"
            }
        }

궁극의 커피 그라인더

            Price:
            $89.99

티어 1 – 골드 스탠다드 (숨겨진 JSON)

Modern websites often embed structured data in <script> tags (usually JSON‑LD for SEO or a “window state” object for frameworks like Next.js). This source is highly stable because it is independent of the HTML layout.

import json
from parsel import Selector

def extract_tier_1(selector):
    # Locate the script tag containing JSON‑LD
    json_data = selector.css('script[type="application/ld+json"]::text').get()
    if json_data:
        data = json.loads(json_data)
        # Navigate the dictionary safely
        return data.get('offers', {}).get('price')
    return None

sel = Selector(text=html_content)
print(f"Tier 1 Result: {extract_tier_1(sel)}")

티어 2 – 시맨틱 앵커 (IDs & Data Attributes)

If JSON‑LD isn’t available, look for Semantic Anchors. These attributes describe what the data is rather than how it looks. IDs and data‑* attributes are frequently used for state management or end‑to‑end testing and change far less often than styling classes.

def extract_tier_2(selector):
    # Try an ID first. If IDs are dynamic, use a “starts‑with” selector.
    price = selector.css('[id^="price-id-"]::text').get()

    # Fall back to data attributes often used in modern frameworks
    if not price:
        price = selector.css('[data-testid="product-price"]::text').get()

    return price.replace('$', '').strip() if price else None

티어 3 – 텍스트‑기반 관계 로직 (XPath)

When clean IDs are missing, rely on the visible text labels. On an e‑commerce site, the word “Price:” is almost always present next to the actual value.

Using XPath axes, you can locate the element containing the label “Price:” and navigate to its neighbor. This label‑to‑value relationship usually persists even if the tag types change.

def extract_tier_3(selector):
    # Find a <span> containing "Price:", then get the next sibling <span>
    xpath_query = "//span[contains(text(), 'Price:')]/following-sibling::span/text()"
    price = selector.xpath(xpath_query).get()
    return price.replace('$', '').strip() if price else None

티어 4 – 최후의 수단 (Regex)

Sometimes the DOM is a mess: obfuscated classes, no IDs, and deeply nested structures. In these cases, treat the HTML as a plain string and use regular expressions. Regex ignores the DOM tree entirely, allowing you to pull out values based on patterns.

import re

def extract_tier_4(html):
    # Look for a dollar amount preceded by optional whitespace and a "$"
    match = re.search(r'\$?\s*([0-9]+(?:\.[0-9]{2})?)', html)
    return match.group(1) if match else None

모두 합치기

def waterfall_extract(html):
    selector = Selector(text=html)

    # Tier 1
    price = extract_tier_1(selector)
    if price:
        return price

    # Tier 2
    price = extract_tier_2(selector)
    if price:
        return price

    # Tier 3
    price = extract_tier_3(selector)
    if price:
        return price

    # Tier 4
    return extract_tier_4(html)

print("Final price:", waterfall_extract(html_content))

모의 HTML에 스크립트를 실행하면 다음과 같은 결과가 나옵니다:

Final price: 89.99

Recap

Start with hidden, machine‑readable data (JSON‑LD, API payloads).
Fall back to semantic anchors (id, data‑*).
Use relational XPath based on stable text labels.
Resort to regex only when the DOM offers no reliable hooks.

By following this Waterfall Method, your scrapers become far more resilient to redesigns, class renames, and other superficial changes—saving you countless late‑night debugging sessions. Happy scraping!

추가 정규식 폴백 (Tier 4 – 대안)

모든 방법이 실패할 경우, JavaScript 변수나 깊게 중첩된 문자열 안에 숨겨진 가격 패턴을 검색할 수 있습니다.

import re

def extract_tier_4(html_string):
    # Search for a pattern like price: "89.99" anywhere in the raw HTML
    match = re.search(r'price":\s*"([\d.]+)"', html_string)
    if match:
        return match.group(1)
    return None

종합적인 워터폴 함수와 로깅

계층화된 메서드들을 하나의 함수로 결합합니다. 가장 안정적인 방법을 우선적으로 사용하고, 낮은 단계로 넘어가야 할 경우 경고를 로그에 남깁니다. 이 알림 시스템은 스크레이퍼가 실제로 중단되기 전에 사이트가 변경되었음을 알려줍니다.

import logging
from parsel import Selector   # or any selector library you use

logging.basicConfig(level=logging.INFO)

def get_product_price(html):
    sel = Selector(text=html)

    # Tier 1: JSON‑LD
    price = extract_tier_1(sel)
    if price:
        return price
    logging.warning("Tier 1 failed. Falling back to Tier 2 (Attributes).")

    # Tier 2: Semantic Attributes
    price = extract_tier_2(sel)
    if price:
        return price
    logging.warning("Tier 2 failed. Falling back to Tier 3 (XPath Relational).")

    # Tier 3: XPath Relational
    price = extract_tier_3(sel)
    if price:
        return price
    logging.error("Tier 1‑3 failed. Attempting Tier 4 (Regex) as last resort.")

    # Tier 4: Regex on raw string
    return extract_tier_4(html)

final_price = get_product_price(html_content)
print(f"Final Extracted Price: {final_price}")

왜 이것이 중요한가

웹사이트 소유자가 사이트를 업데이트한다고 상상해 보세요: JSON‑LD (Tier 1)를 삭제하고 모든 CSS 클래스 (Tier 2)를 변경합니다.

전통적인 스크래퍼에서는 코드가 None을 반환하고 충돌합니다. Waterfall Method를 사용하면 Tier 3 로직이 여전히 데이터를 찾습니다. 로그에 경고가 표시되어, 심야에 긴급 상황을 처리하는 대신 근무 시간 중에 주요 선택자를 업데이트할 수 있습니다.

마무리

머신 판독 가능한 데이터 우선 순위: 먼저 JSON‑LD 또는 <script> 태그를 확인하세요.
시맨틱 앵커 사용: CSS 클래스보다 data- 속성과 id 태그를 선호하세요.
XPath 관계 사용: 인간이 읽을 수 있는 레이블을 앵커로 사용해 인접 데이터를 찾으세요.
폴백 모니터링: 스크래퍼가 낮은 단계에 도달했을 때 로그를 남겨 선택자 변경을 사전에 대응하세요.

취약한 클래스 기반 선택자를 사용하지 않음으로써, 깨진 코드를 수정하는 데 드는 시간을 줄이고 데이터 활용에 더 많은 시간을 할애할 수 있습니다. 더 고급 예제는 Homedepot.com Scrapers repository를 참고하세요.

Waterfall Pattern: 신뢰할 수 있는 데이터 추출을 위한 계층형 전략

워터폴 방법 – 탄력적인 스크래퍼 구축

안정성 계층 구조

환경 설정

모의 HTML 스니펫

궁극의 커피 그라인더

티어 1 – 골드 스탠다드 (숨겨진 JSON)

티어 2 – 시맨틱 앵커 (IDs & Data Attributes)

티어 3 – 텍스트‑기반 관계 로직 (XPath)

티어 4 – 최후의 수단 (Regex)

모두 합치기

Recap

추가 정규식 폴백 (Tier 4 – 대안)

종합적인 워터폴 함수와 로깅

왜 이것이 중요한가

마무리

관련 글

Monad Testnet 마스터하기: Python으로 개발자 활동 자동화 🐍

JSON을 CSV로 변환하기 위해 파이썬 스크립트를 작성하는 것을 멈추세요 🛑

Python의 비밀스러운 삶: Truthiness와 Falsy Values

나는 매일 당신을 체크인하는 텔레그램 Accountability Bot을 만들었다

워터폴 방법 – 탄력적인 스크래퍼 구축

안정성 계층 구조

환경 설정

모의 HTML 스니펫

궁극의 커피 그라인더

티어 1 – 골드 스탠다드 (숨겨진 JSON)

티어 2 – 시맨틱 앵커 (IDs & Data Attributes)

티어 3 – 텍스트‑기반 관계 로직 (XPath)

티어 4 – 최후의 수단 (Regex)

모두 합치기

Recap

추가 정규식 폴백 (Tier 4 – 대안)

종합적인 워터폴 함수와 로깅

왜 이것이 중요한가

마무리

관련 글

Monad Testnet 마스터하기: Python으로 개발자 활동 자동화 🐍

JSON을 CSV로 변환하기 위해 파이썬 스크립트를 작성하는 것을 멈추세요 🛑

Python의 비밀스러운 삶: Truthiness와 Falsy Values

나는 매일 당신을 체크인하는 텔레그램 Accountability Bot을 만들었다

티어 1 – 골드 스탠다드 (숨겨진 JSON)

티어 2 – 시맨틱 앵커 (IDs & Data Attributes)

티어 3 – 텍스트‑기반 관계 로직 (XPath)

티어 4 – 최후의 수단 (Regex)

추가 정규식 폴백 (Tier 4 – 대안)