왜 Selenium 테스트가 AI 챗봇에서 실패하는가 (그리고 해결 방법)

발행: 1개월 전 (2025년 12월 14일 오전 06:14 GMT+9)

5 분 소요

원문: Dev.to

Source: Dev.to

What You’ll Learn

The Problem: WebDriverWait가 스트리밍 응답에서 실패하는 이유
MutationObserver: 브라우저에서 제로‑폴링 스트림 감지
Semantic Assertions: 비결정적 출력에 대한 ML‑기반 검증
TTFT Monitoring: LLM 성능을 위한 Time‑To‑First‑Token 측정

The Fundamental Incompatibility

전통적인 Selenium WebDriver 테스트는 콘텐츠가 한 번 로드되고 안정화되는 정적 페이지를 전제로 합니다. AI 챗봇은 다음 두 가지 방식으로 이 전제를 깨뜨립니다:

Streaming Responses – 토큰이 2–5 초에 걸쳐 하나씩 도착합니다. WebDriverWait는 종종 첫 번째 토큰에서 트리거되어 부분 텍스트만 캡처합니다.
Non‑Deterministic Output – 같은 질문에 대해 서로 다른(하지만 동등한) 답변이 나올 수 있어 정확한 문자열 어설션이 실패합니다.

User: "Hello"
AI Response (Streaming):
  t=0ms:    "H"
  t=50ms:   "Hello"
  t=100ms:  "Hello! How"
  t=200ms:  "Hello! How can I"
  t=500ms:  "Hello! How can I help you today?"  ← FINAL

Standard Selenium captures: "Hello! How can I"  ← PARTIAL (FAIL!)

The Usual Hacks (And Why They Fail)

Hack	Why It Fails
`time.sleep(5)`	임의적; 너무 짧으면 불안정, 너무 길면 CI가 느려짐
`text_to_be_present`	첫 번째 매치에서 트리거되어 전체 응답을 놓침
Polling with length checks	레이스 컨디션; 스트리밍 중간에 길이가 고정될 수 있음
Exact string assertions	비결정적 AI 출력과는 불가능

The Real Cost

팀은 테스트 시간의 30 % 정도를 플레이키 AI 테스트 디버깅에 소비하고, 실제 커버리지 향상에는 사용하지 못합니다.

The Solution: Browser‑Native Stream Detection

브라우저는 스트리밍이 언제 끝나는지를 알고 있습니다. MutationObserver API를 사용해 JavaScript에서 직접 DOM 변화를 감시하면 Python 폴링과 임의 대기 시간을 없앨 수 있습니다.

from selenium_chatbot_test import StreamWaiter
from selenium.webdriver.common.by import By

# Wait for the AI response to complete streaming
waiter = StreamWaiter(driver, (By.ID, "chat-response"))
response_text = waiter.wait_for_stable_text(
    silence_timeout=500,      # 변화가 500 ms 없을 때 "완료"로 간주
    overall_timeout=30000   # 최대 대기 시간
)

StreamWaiter는 MutationObserver를 주입해 매 DOM 변형마다 타이머를 재설정합니다. 타이머가 silence_timeout 동안 중단 없이 도달했을 때만 반환되어 전체 응답을 보장합니다.

Semantic Assertions: Testing Meaning, Not Words

전체 응답을 캡처한 뒤, 정확한 문자열이 아니라 의미 유사도로 비교합니다.

from selenium_chatbot_test import SemanticAssert

asserter = SemanticAssert()

expected = "Hello! How can I help you today?"
actual = "Hi there! What can I assist you with?"

asserter.assert_similar(
    expected,
    actual,
    threshold=0.7  # 70 % 이상의 의미 유사도 필요
)
# ✅ PASSES – 같은 의도를 전달함

이 라이브러리는 sentence-transformers의 all-MiniLM-L6-v2 모델을 사용해 임베딩을 생성하고 코사인 유사도를 계산합니다. 모델은 첫 사용 시에만 로드되며 CPU에서 실행되므로 CI에 GPU가 필요 없습니다.

TTFT: The LLM Performance Metric You’re Not Tracking

Time‑To‑First‑Token (TTFT) 은 사용자 경험에 핵심적인 지표입니다. 챗봇이 응답을 시작하기까지 3 초가 걸리면 전체 응답 시간이 괜찮아도 사용자는 깨진 느낌을 받습니다. 대부분의 팀은 이 지표를 전혀 모니터링하지 못합니다.

from selenium_chatbot_test import LatencyMonitor
from selenium.webdriver.common.by import By

with LatencyMonitor(driver, (By.ID, "chat-response")) as monitor:
    send_button.click()
    # ... wait for response ...

print(f"TTFT: {monitor.metrics.ttft_ms} ms")       # 예: 41.7 ms
print(f"Total: {monitor.metrics.total_ms} ms")   # 예: 2434.8 ms
print(f"Tokens: {monitor.metrics.token_count}") # 예: 48 mutations

Real Demo Results

TTFT: 41.7 ms
Total time: 2.4 s
Semantic accuracy: 71 %

Putting It All Together

전통적인 Selenium으로는 불가능했던 완전한 테스트 예시:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium_chatbot_test import StreamWaiter, SemanticAssert, LatencyMonitor

def test_chatbot_greeting():
    driver = webdriver.Chrome()
    driver.get("https://my-chatbot.com")

    # Type a message
    input_box = driver.find_element(By.ID, "chat-input")
    input_box.send_keys("Hello!")

    # Monitor latency while waiting for response
    with LatencyMonitor(driver, (By.ID, "response")) as monitor:
        driver.find_element(By.ID, "send-btn").click()

        # Wait for streaming to complete (no time.sleep!)
        waiter = StreamWaiter(driver, (By.ID, "response"))
        response = waiter.wait_for_stable_text(silence_timeout=500)

    # Assert semantic meaning, not exact words
    asserter = SemanticAssert()
    asserter.assert_similar(
        "Hello! How can I help you today?",
        response,
        threshold=0.7
    )

    # Verify performance SLA
    assert monitor.metrics.ttft_ms

GitHub:

왜 Selenium 테스트가 AI 챗봇에서 실패하는가 (그리고 해결 방법)

What You’ll Learn

The Fundamental Incompatibility

The Usual Hacks (And Why They Fail)

The Real Cost

The Solution: Browser‑Native Stream Detection

Semantic Assertions: Testing Meaning, Not Words

TTFT: The LLM Performance Metric You’re Not Tracking

Real Demo Results

Putting It All Together

관련 글

왜 작은 일일 작업이 생산성을 저해하는가 — 그리고 하나의 Hub가 해결한다

Apache 개발자 리스트 요약: Iceberg, Polaris, Arrow & Parquet (2025년 12월 9일 – 12월 15일)

Story CLI 구축: 30분 IP 등록에서 5분 이하까지

Solar Energy + Mercado Livre for MEI: 2025년 기술 요구사항