Zero-Stall AI: Building a Self-Managing TDD Pipeline with Autonomous Agents

Published: (June 8, 2026 at 11:13 AM EDT)
8 min read
Source: Dev.to

Source: Dev.to

published: false https://images.unsplash.com/photo-1555949963-ff9fe0c870eb?w=1000

tl;dr — Point an AI agent at a GitHub Issue, have it write a failing E2E test, implement the fix, commit with full provenance metadata, deploy to staging, and ping you on Telegram. One tap to approve. Ship. Most teams using AI coding assistants hit the same wall: Agents stall waiting for a human to respond in the IDE Context windows expire mid-task, losing all progress No audit trail — you don’t know which model wrote what, how long it took, or how many tokens it cost Tests are an afterthought — AI writes code first, tests sometimes never Staging review requires a laptop — killing async workflows This post describes a systematic architecture that solves all five. The foundation is simple: a GitHub Issue is the single source of truth for every unit of work. Each issue contains: A reference to the relevant PRD section Acceptance criteria written as plain-language assertions The last iteration snapshot (what was done, what failed, what’s next) Links to test artefacts (video, trace, HTML report) Token/time metadata from every AI session When an AI agent starts a task, it reads the issue. When it ends — whether it finished or ran out of tokens — it writes back to the issue. The next agent (or the same one in a new session) picks up exactly where things left off. Why Issues and not a file? Issues survive branch switches, are visible to all team members, support comments and labels, and integrate natively with CI/CD triggers. ┌──────────────────────────────────────────┐ │ 📋 GITHUB ISSUE │ │ PRD ref · Acceptance Criteria · State │ └──────────────────┬───────────────────────┘ │ ▼ ┌──────────────────────────────────────────┐ │ 🔴 RED PHASE │ │ Write Playwright spec from criteria │ │ Test MUST fail before code is written │ └──────────────────┬───────────────────────┘ │ FAIL confirmed ✓ ▼ ┌──────────────────────────────────────────┐ │ 🛠️ GREEN PHASE │ │ AI implements minimal fix │ ◄──── loops here on CHANGE │ Guardian reviews: types · no duplication│ └──────────────────┬───────────────────────┘ │ re-run test ▼ ┌──────────────────────────────────────────┐ │ ✅ PASS │ │ Video · Trace · HTML report saved │ │ Artefacts posted to Issue comment │ └──────────────────┬───────────────────────┘ │ ▼ ┌──────────────────────────────────────────┐ │ 📦 GITOPS COMMIT │ │ branch: tdd/issue-slug │ │ platform · model · tokens · duration │ │ rollback tag created │ └──────────────────┬───────────────────────┘ │ ▼ ┌──────────────────────────────────────────┐ │ 🚀 STAGING DEPLOY │ │ Auto-deploy on tdd/* push │ │ Preview URL → Issue + Telegram │ └──────────────────┬───────────────────────┘ │ ▼ ┌──────────────────────────────────────────┐ │ 👤 HUMAN REVIEW (on your phone) │ │ Artefacts + checklist via Telegram │ │ │ │ APPROVE ──► Merge to main │ │ CHANGE ──► Back to GREEN PHASE │ │ ESCALATE──► human-blocked · agent exits │ └──────────────────────────────────────────┘

The one invariant rule: A test must fail before any code is written. This forces acceptance criteria to be precise, ensures the test exercises the right behaviour, and gives a clear signal when the implementation is complete. An issue-knowledge-manager agent reads the current issue, extracts the PRD reference, parses acceptance criteria, and loads the last iteration snapshot. This costs ~500 tokens and takes under 10 seconds. Every subsequent agent in the loop starts from this shared context. A qa-test-engineer agent writes an E2E spec from the acceptance criteria. Before handing off, it runs the test suite and confirms the new test fails. A test that passes immediately means the criterion was already satisfied — or the test is wrong. Either way, stop and investigate. A frontend-dev or backend-dev agent implements the minimal code change. A guardian agent then reviews the diff: no new any types, no duplicate logic, patterns consistent with the codebase. Only after approval does the loop return to Phase 2 for re-run. Once the test passes, a release-automation agent commits with structured metadata: [vscode/claude-sonnet-4] fix: table sort order matches canvas view

Issue: #32 Platform: VSCode Extension Model: claude-sonnet-4 Tokens used: ~11,200 Duration: 22 min Tests: 14/14 passing Staging: https://your-app-pr-42.vercel.app

A rollback tag is created before the commit: test-pass/32/2026-03-31. Any other AI environment can roll back to this exact state with one command. A devops-engineer agent ensures every tdd/* branch triggers an automatic staging deploy. The preview URL is posted to the GitHub Issue and sent via the messaging gateway. The orchestrator sends a notification containing: A link to the Playwright video recording The HTML test report The staging preview URL A checklist of acceptance criteria with pass/fail status You reply with one word: APPROVE, CHANGE, or ESCALATE. The agent handles the rest. The biggest practical failure mode for AI agents is getting stuck. Here is the full safety net: Trigger → Agent Response ───────────────────────────────────────────────────────────────────── Token count > 80k (soft) → Save snapshot to Issue · continue Token count > 95k (hard) → Save snapshot · Telegram alert · EXIT No tool response for 30 min → Save snapshot · Telegram “stalled on #N” · EXIT Same action repeated 3× → Break loop · log to Issue · Telegram · EXIT API key exhausted → Rotate to fallback key · log rotation · continue

The exit contract: An agent that exits cleanly always writes a snapshot to the issue first — what was completed, what was in progress, the exact file and line being worked on. Any agent that reads this snapshot can continue from that exact point, in any environment. 📋 GITHUB ISSUE (Source of Truth) │ ▼ ┌─────────────────────────────────────┐ ORCHESTRATION LAYER │ tdd-orchestrator │ Drives the loop · enforces token budget │ issue-knowledge-manager │ Reads + writes issue state └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ IMPLEMENTATION LAYER │ qa-test-engineer │ Playwright specs · artefact collection │ frontend-dev / backend-dev │ Minimal fix · strict types · no any │ guardian │ Code review gate · no duplication └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ SHIPPING LAYER │ release-automation │ Commit · metadata · rollback tag · PR │ devops-engineer │ Staging deploy · preview URL └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ HUMAN LOOP │ Messaging Gateway (Telegram/Slack) │ Artefacts + checklist to your phone │ You │ APPROVE · CHANGE · ESCALATE └─────────────────────────────────────┘ │ └──────── decision flows back to orchestrator

Each layer is independently replaceable. Swap Playwright for Cypress. Swap Telegram for Slack. The orchestration contract stays the same. Every commit in this system is self-describing. Someone (or another AI) reading the git log six months from now can reconstruct exactly:

Field What it tells you

Commit message What changed

Issue reference Why it changed (links to acceptance criteria)

Tests: 14/14 How it was validated

Model: claude-sonnet-4 What wrote it

Tokens: ~11,200 What it cost

Staging URL Where to see it live

This is not overhead. It is the foundation of trustworthy AI-assisted development. Every time an agent writes back to an issue, it uses this structured template:

Iteration Snapshot — 2026-03-31 14:22

Status: PASS Agent: qa-test-engineer + frontend-dev Platform: VSCode Extension | Model: claude-sonnet-4 Tokens: ~13,400 | Duration: 24 min

Completed this iteration

  • Wrote Playwright spec for acceptance criterion 2 (table sort order)
  • Confirmed RED: test failed on expect(rows[0]).toBe('SKU-001')
  • Implemented sort fix in TableView.tsx:214
  • Confirmed GREEN: 14/14 tests passing
  • Committed: abc1234 · Tagged: test-pass/32/2026-03-31

Artefacts

Next step

Acceptance criterion 3 — clicking a table row should select the node on canvas. Start at: TableView.tsx + useCanvasSelection hook.

Before After

Agent stalls waiting for IDE response Times out, saves state, exits, pings you

Context window resets kill progress Issue snapshot = resumable from any environment

No idea what the AI changed or why Every commit is a fully documented time capsule

Tests written after the fact (if at all) Tests define done — no test, no merge

Staging review requires a laptop One-tap approve from your phone

Token exhaustion = lost work Snapshot at 80k, graceful exit at 95k

AI writes the same pattern twice Guardian agent blocks duplication before commit

[ ] Define acceptance criteria in GitHub Issues (not just task descriptions) [ ] Set up E2E testing (Playwright) with video: ‘on’ and trace: ‘on’

[ ] Configure branch-based staging deploys (Vercel, Netlify, or equivalent) [ ] Set up a messaging gateway for human-in-the-loop notifications (Telegram bot is easiest) [ ] Write agent definition files for each role (orchestrator, qa, dev, release, devops) [ ] Establish the commit metadata convention — enforce it from day one [ ] Set token budget thresholds — 80k soft, 95k hard is a solid baseline [ ] Create an issue snapshot template so all agents write consistent state The goal is not to remove humans from software development. It is to remove humans from the parts that do not require human judgment — running tests, writing boilerplate, deploying previews, rotating API keys — and to surface the parts that do, cleanly, on the device you actually have in your hand. A GitHub Issue with a clear acceptance criterion, an E2E test that fails first, a commit that documents its own provenance, and a one-tap decision from your phone — that is a workflow a team can trust, audit, and scale. The AI does not need to be perfect. It needs to be accountable. Tags: #aiagents #tdd #devops #llmops #playwright #vercel #gitops

0 views
Back to Blog

Related posts

Read more »