Stop Testing Success. Kill the Database. 🧨

Published: (December 11, 2025 at 05:00 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Introduction

Chaos Engineering for QA

Intro to Chaos Engineering for QA. Learn how to test resilience by injecting failures with Docker and Playwright.

We are obsessed with the “Happy Path”.

In traditional QA, we verify that the application works when everything is perfect:

  • The network is stable.
  • The database responds in 5 ms.
  • Third‑party APIs are online.

But in production, nothing is perfect. Pods crash, networks lag, and databases lock up.

When these things happen, a standard Selenium/Playwright test just says Failed. It doesn’t tell you how the application failed. Did it show a graceful error message? Or did it crash with a white screen and a raw stack trace?

This is where Chaos Engineering comes in.

From QA to Resilience Engineering

Chaos Engineering isn’t just for Site Reliability Engineers (SREs). As modern QAs, we need to stop asking “Does it work?” and start asking “What happens when it breaks?”

Today, I’ll show you how to write a Chaos Test using Python, Playwright, and the Docker SDK.

The Goal

We aren’t going to wait for the database to fail. We are going to kill it intentionally in the middle of a test and verify that our frontend handles it gracefully.

The Stack

  • Python – test logic
  • Playwright – UI interaction
  • Docker SDK – the chaos injector

The Code 🐍

import docker
import time
from playwright.sync_api import Page, expect

def test_database_failure_resilience(page: Page):
    # 1. Setup: Connect to Docker
    client = docker.from_env()

    # Target your specific database container
    try:
        db_container = client.containers.get("postgres-prod")
    except docker.errors.NotFound:
        raise Exception("Database container not found! Is Docker running?")

    # 2. Happy Path: Verify the app loads normally
    print("✅ Step 1: Loading Dashboard...")
    page.goto("http://localhost:3000/dashboard")
    expect(page.locator(".user-balance")).to_be_visible()

    # 🧨 CHAOS TIME: Kill the Database
    print("🔥 Step 2: Injecting Chaos (Stopping DB)...")
    db_container.stop()

    # 3. Resilience Assertion
    # The app should NOT show a white screen or crash.
    # It SHOULD show a friendly "Connection Lost" toast or retry button.
    print("👀 Step 3: Verifying graceful degradation...")

    # Trigger an action that requires the DB
    page.reload()

    # Assert UI handles the error
    expect(page.locator(".error-toast")).to_contain_text("Connection lost")
    expect(page.locator(".retry-button")).to_be_visible()

    # 🩹 RECOVERY: Bring the Database back
    print("🩹 Step 4: Healing the infrastructure...")
    db_container.start()

    # Give the app a moment to reconnect (or trigger a manual retry)
    page.locator(".retry-button").click()

    # 4. Self‑Healing Assertion
    # The app should recover without requiring a full page refresh
    expect(page.locator(".user-balance")).to_be_visible()
    print("✅ Test Passed: System is resilient.")

Why This Matters

If you run this test and your application shows a 500 Server Error page, you have found a bug—not a functional bug, but an architectural bug.

By adding “Chaos Tests” to your regression suite, you guarantee that your product doesn’t just work—it survives.

Want More Chaos?

I write The 5‑Minute QA—a daily newsletter for Senior QAs and SDETs. Every morning, I send one actionable tip on Chaos Engineering.

👉 Subscribe here to get the tips in your inbox

Back to Blog

Related posts

Read more »

Stop Buying Macs Just to Fix CSS

The “Hacker” Way to Debug Safari on Windows & Linux Let’s be honest: Safari is the new Internet Explorer. As web developers we work mostly with Chromium Chrome...