Testing in the Age of AI Agents: How I Kept QA from Collapsing

Published: (January 12, 2026 at 02:11 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

AI agents changed my development tempo overnight. I can ship more code in a day than I used to in a week, and that sounds great until the first time a tiny edge case takes down an entire flow.

At that speed, QA becomes either a competitive advantage or a constant fire drill. I chose the first option, and I rebuilt my testing approach in d:\Coding\Company\Ochestrator around a small set of test‑design techniques that scale with code volume:

  • TDD
  • EP‑BVA (Equivalence Partitioning + Boundary Value Analysis)
  • Pairwise (Combinatorial Testing)
  • State Transition Testing

Testing Toolbox Diagram

Why I Needed “Test Design,” Not Just “More Tests”

When code volume grows, the problem is not only coverage. The real problem is that the space of possible inputs and states grows faster than my time.

So I stopped asking:

  • “Did I write tests for this function?”

and started asking:

  • “Did I select test cases that actually represent the failure surface?”

That mindset pushed me toward structured test‑design techniques.

TDD: Design for Testability from Day One

The Principle: TDD (Test‑Driven Development) flips the traditional “write code, then test” workflow. It follows the Red‑Green‑Refactor cycle:

PhaseDescription
RedWrite a test for a new requirement and watch it fail. This confirms the test actually checks something and that the requirement isn’t already met.
GreenWrite the minimal amount of code to make the test pass. Avoid “over‑engineering” at this stage.
RefactorClean up the code while ensuring the tests stay green.

In Orchestrator:
Since AI agents can generate complex business logic rapidly, I used TDD to ensure that the logic was testable by design. For example, when implementing the RetryPolicy for our Temporal workflows, I started with the test cases for exponential backoff before writing a single line of the policy logic.

# Simplified TDD Example for Retry Logic
def test_retry_interval_calculation():
    policy = ExponentialRetry(base_delay=1.0, max_delay=10.0)
    # 1st attempt: 1.0 s
    assert policy.get_delay(attempt=1) == 1.0
    # 2nd attempt: 2.0 s
    assert policy.get_delay(attempt=2) == 2.0
    # Capped at 10.0 s
    assert policy.get_delay(attempt=10) == 10.0

This forced me to separate the calculation of delays from the execution of the retry, making the system modular and robust.

EP‑BVA: Efficiency through Mathematical Selection

The Principle

  • Equivalence Partitioning (EP): Divide the input domain into groups (partitions) where the system is expected to behave identically. Test one representative value from each group.
  • Boundary Value Analysis (BVA): Bugs often hide at the “edges” of these partitions. Test the exact boundaries and values just inside and outside them.

In Orchestrator:
When handling user‑uploaded files, we have strict size limits (e.g., 1 MB to 10 MB).

PartitionDescription
Invalid10 MB

BVA Points: 0.99 MB, 1.0 MB, 1.01 MB, 9.99 MB, 10.0 MB, 10.01 MB.

A critical real‑world example I applied was the 72‑byte limit of bcrypt. Many developers don’t realize that bcrypt ignores any characters after the 72nd byte.

# apps/backend/tests/test_auth_service.py
def test_password_length_boundaries(self, auth_service):
    # Boundary: 72 bytes
    p72 = "a" * 72
    h72 = auth_service.get_password_hash(p72)

    # Just above the boundary: 73 bytes
    p73 = p72 + "b"
    # Bcrypt will treat p73 the same as p72 if only the first 72 bytes are used
    assert auth_service.verify_password(p73, h72) is True

By focusing on these specific points, I reduced hundreds of potential test cases to just 6‑10 highly effective ones.

Pairwise: Taming the Combinatorial Explosion

The Principle: Most bugs are caused by either a single input parameter or the interaction between two parameters. Pairwise Testing is a combinatorial method that ensures every possible pair of input parameters is tested at least once. This drastically reduces the number of test cases while maintaining high defect detection.

In Orchestrator:
Our AI inference engine has multiple configuration axes:

AxisOptions
Execution ProviderCUDA, CPU, OpenVINO
Model SizeSmall, Medium, Large
QuantizationINT8, FP16, FP32
Async ModeEnabled, Disabled

Total combinations: (3 \times 3 \times 3 \times 2 = 54) cases.

Using Pairwise, we can cover all interactions between any two settings in roughly 12‑15 cases.

# Using allpairspy to generate the matrix
from allpairspy import AllPairs

parameters = [
    ["CUDA", "CPU", "OpenVINO"],
    ["Small", "Medium", "Large"],
    ["INT8", "FP16", "FP32"],
    ["Enabled", "Disabled"]
]

for i, combo in enumerate(AllPairs(parameters)):
    print(f"Test Case {i}: {combo}")

This allows us to maintain high confidence in our hardware‑compatibility matrix without running the full 54‑case suite on every PR.

State Transition Testing: Mapping the Life of a Process

The Principle: This technique is used when the system’s behavior depends on its current state and the events that occur. We map out a State Transition Diagram and ensure that:

  • All valid transitions are possible.
  • All invalid transitions are rejected or handled gracefully.

In Orchestrator:
Consider a simplified order‑processing workflow with the states Created → Approved → Shipped → Delivered. Events such as approve, ship, deliver, and cancel trigger transitions. By enumerating every state/event pair, we generate a concise test matrix that validates both happy‑path flows and error handling (e.g., attempting to ship an order that is still Created).

# Example state‑transition test matrix
states = ["Created", "Approved", "Shipped", "Delivered", "Cancelled"]
events = {
    "approve": {"Created": "Approved"},
    "ship":    {"Approved": "Shipped"},
    "deliver": {"Shipped": "Delivered"},
    "cancel": {"Created": "Cancelled", "Approved": "Cancelled"}
}

def test_state_transitions():
    for event, mapping in events.items():
        for src, dst in mapping.items():
            assert transition(src, event) == dst
        # Verify invalid transitions raise an error
        invalid_src = set(states) - set(mapping.keys())
        for src in invalid_src:
            with pytest.raises(InvalidTransition):
                transition(src, event)

By systematically covering the state‑space, we catch bugs that only appear after a specific sequence of actions—something that pure unit‑test coverage often misses.

Negative Testing for State Transitions

  • The system ends in the correct final state.

In Orchestrator

The KYC (Know Your Customer) verification workflow is a complex state machine. A user’s document moves through:

PENDING → UPLOADING → PROCESSING → VERIFIED or REJECTED

I implemented tests to ensure a REJECTED document cannot suddenly jump to VERIFIED without going through PROCESSING again.

# apps/backend/tests/test_integration_kyc_workflow.py
def test_invalid_state_transitions(workflow_engine):
    workflow_engine.set_state(ImageStatus.REJECTED)

    # This should be blocked by the business logic
    with pytest.raises(IllegalStateError):
        workflow_engine.transition_to(ImageStatus.VERIFIED)

This is crucial for AI agents that might try to “short‑circuit” logic. By strictly testing the state machine, we ensure the integrity of the entire business process.

Conclusion

In the AI‑agent era, code is cheap. Trust is not.

What kept my QA from collapsing was not writing more tests, but adopting test‑design techniques that scale:

  • TDD for fast feedback and safer refactors
  • EP‑BVA to systematize edge cases
  • Pairwise to tame combinatorial growth
  • State Transition Testing to validate real workflows

These are the testing tools I expect to keep using as my code volume continues to accelerate.

Back to Blog

Related posts

Read more »

Hello, Newbie Here.

Hi! I'm falling back into the realm of S.T.E.M. I enjoy learning about energy systems, science, technology, engineering, and math as well. One of the projects I...