We Dogfooded Our Own 110-Page Production Playbook. Here's What We Learned.

Published: 1 hour ago (February 5, 2026 at 12:54 AM EST)

10 min read

Source: Dev.to

We Dogfooded Our Own 110‑Page Production Playbook. Here’s What We Learned.
Or: How we discovered that writing about best practices doesn’t mean you’re following them

The Setup: Building a Guide We Weren’t Following

Three weeks ago we shipped something we were genuinely proud of: the Production Deployment Playbook, a 110‑page comprehensive guide for taking AI agents from prototype to production.

Why we wrote it: Gartner predicts that 40 % of GenAI projects will be canceled by 2027 because of the massive gap between building a demo and running a reliable service.
Our pain: We’d felt that gap ourselves, watched teams struggle, and decided to document everything we’d learned.

The playbook covers the full spectrum:

Governance frameworks for AI decision‑making
Security best practices for LLM applications
Monitoring and observability strategies
Infrastructure‑as‑code templates
Testing methodologies
Incident‑response procedures

We poured months of real‑world experience into those pages, interviewed teams that had made it to production, and documented the failure modes nobody talks about at conferences. It was comprehensive, practical, and good—until someone asked the obvious question:

“Do we actually follow this ourselves?”

The silence that followed was telling. We’d been so focused on documenting best practices for others that we hadn’t audited our own house. We’d become the proverbial cobbler whose children have no shoes—or, in our case, the AI infrastructure company whose own agent platform was held together with duct tape and hope.

So we did what any reasonably self‑aware team would do: we grabbed our own playbook, turned it on ourselves, and started scoring. What we found was humbling, instructive, and honestly kind of hilarious in that painful way that only true self‑recognition can be.

The Audit: A Brutally Honest Self‑Assessment

We approached this the way we recommend others do in Chapter 3 of the playbook: structured, systematic, and without mercy.

Created a scoring rubric based on the playbook’s key areas.
Assigned point values and started checking boxes.

Results

Component	Score /10
Agent kits (SDK, testing frameworks, monitoring libraries, deployment utilities)	8‑9
forAgents.dev (the website where we publish all this wisdom)	2

The irony wasn’t lost on us. We’d created a comprehensive guide to production deployment while running a production service that violated most of its principles. It’s like publishing a book on minimalism from a cluttered apartment, or teaching time‑management while chronically late.

But irony is only useful if it teaches you something. The gap between our agent kits and forAgents.dev revealed a classic meta‑problem in software development:

We’re great at building tools for specific problems, but significantly worse at applying those tools to our own work.

It’s the difference between a mechanic who builds excellent tools and one who maintains his own truck.

Why the Gap?

“We’ll clean this up later” mindset – ship fast, accumulate technical debt, promise to fix it when things slow down (spoiler: they never do).
Documentation as a substitute for practice – we knew what to do, we wrote it down, and felt we’d solved the problem. Actually doing it? That’s where the work lives.

What We Found: The Gap Between Knowing and Doing

Let’s get specific. Here’s what we discovered when we audited forAgents.dev against our own standards:

1. Zero Test Coverage (Literally 0 %)

The playbook dedicates 12 pages to testing strategies for AI agents (unit, integration, regression, safety, performance, adversarial testing).
forAgents.dev had exactly zero tests—not “minimal coverage,” but no test files at all.
No tests for:
- API endpoints handling agent submissions
- Authentication flow
- Rating and review system
- Search functionality

We were running a production service where any change could break anything, and we had no way of knowing until users complained. The kicker? One of our agent kits is literally a testing framework for AI agents—we shipped a sophisticated tool for testing agent behavior, then didn’t use it ourselves. Classic.

2. No Multi‑Environment Setup

Chapter 5 of the playbook recommends at minimum three environments: development, staging, production.
Our reality: only production. Every code change went from laptop straight to live users.
- Want to test a new feature? Push it to production and hope.
- Need to debug something? Debug it in production.
- Risky database migration? You guessed it—production.

We had no staging environment to catch issues before they hit users, and no development environment that mirrored production’s configuration. Every deployment was a high‑wire act without a net. The scary part? This worked… until it didn’t. We got lucky, but luck isn’t a strategy.

Stay tuned for the next sections where we detail the remaining gaps (security, observability, incident response) and share the concrete steps we’re taking to bring our own service up to the standards we set for everyone else.

Luck Isn’t a Strategy, and “It Hasn’t Exploded Yet” Isn’t the Same as “It Won’t Explode”

Rate Limiting: A Solution Built for Others

This one stings because we literally built a rate‑limiting library while writing the playbook. We documented it thoroughly, open‑sourced it, and recommended it as a critical production safeguard against runaway costs and abuse.

Until we built that library, forAgents.dev had zero rate limiting on API endpoints. A malicious client could have:

Hit our agent‑submission endpoint in a loop and drain our OpenAI credits.
Hammered our search API and taken the site down.

We were completely exposed to both accidental and malicious abuse.

We fixed this by dog‑fooding our own library (more on that later), but the fact that we shipped comprehensive guidance on rate limiting before implementing it ourselves is… let’s call it educational.

The playbook includes detailed guidance on observability:

Metrics to track
Logs to collect
Alerts to configure
Dashboards to build

We recommend tracking error rates, latency percentiles, model token usage, cost per request, and dozens of other signals.

What we actually had:

Basic server logs
Hosting provider’s default metrics

That’s it. We couldn’t answer simple questions like:

“What’s our P95 latency?”
“How many agent submissions failed last week?”
“Which endpoints are most expensive?”

We were running a production service with roughly the same visibility you’d have with a hobby project on Heroku’s free tier. When something went wrong, our debugging process was “scroll through logs and squint.” When users reported slow performance, we had no data to investigate. We were flying blind—exactly the scenario the playbook warns against in Chapter 7.

Incident Response: A Plan Called “Panic”

The playbook includes templates for:

Incident‑response procedures
On‑call rotations
Escalation paths
Post‑mortem formats

These aren’t theoretical—we documented them because we’d lived through chaotic production incidents without clear procedures.

Our actual incident‑response plan for forAgents.dev:

Notice something’s broken.
Panic slightly.
Fix it frantically.
Hope it doesn’t happen again.

No documented procedures, no clear owners, no communication templates, no post‑mortem process. We’d essentially committed the classic error: “this project is too small to need formal incident response.”

But incidents don’t care about project size. When a database goes down at 2 AM, having a plan is the difference between a quick recovery and three hours of confused flailing.

Quick Wins: What We Fixed in 45 Minutes

After staring at our audit scores for a while (and feeling appropriately humbled), we asked: what can we fix right now? We gave ourselves 45 minutes to close the easiest gaps. Here’s what we knocked out:

Testing Infrastructure (15 minutes)

Added Jest and React Testing Library to the project.
Created a basic test structure.
Wrote our first five tests covering critical API endpoints and authentication logic.

We’re not at 80 % coverage, but we went from 0 % to “enough to catch the obvious breaks.” More importantly, we added testing to our CI pipeline (see below), so we literally can’t deploy without tests passing. Future us is now forced to write tests—exactly the kind of constraint that changes behavior.

Rate Limiting (10 minutes)

Integrated our own rate‑limiting library into forAgents.dev.
Added limits to all public API endpoints:
- Authenticated users: 100 requests / hour
- Anonymous users: 20 requests / hour
Configured burst allowances for legitimate high‑volume use.

We now have Grafana dashboards showing rate‑limit hits, which is already teaching us how people actually use the API. The irony of it taking months to build the library but minutes to integrate it is not lost on us.

CI/CD Pipeline (15 minutes)

Set up GitHub Actions to run tests on every pull request.
Configured automatic deployment to production on merge to main.

Because we documented the exact process in Chapter 9 of the playbook, we just… followed our own instructions.

Now every change goes through automated checks. We catch broken builds before they deploy, have a clear history of what changed when, and can roll back instantly if something breaks.

Incident‑Response Playbook (5 minutes)

Created INCIDENTS.md in the repo with clear procedures for common failure scenarios:
- Database down
- API timeouts
- Authentication failures
- Abuse/spam waves
Added a simple on‑call rotation (our small team makes “rotation” generous) and documented escalation paths.

The goal isn’t perfection; it’s having any plan that’s better than “panic and guess.”

Result: In 45 minutes we went from a 2/10 score to maybe a 5/10. We’re not production‑perfect, but we’re no longer production‑reckless. More importantly, we proved that many of these gaps aren’t hard to close—they just require actually doing the work instead of documenting it for others.

The Lesson: Dog‑fooding Reveals What Documentation Can’t

Writing about best practices doesn’t mean you understand them.
We could articulate testing strategies, deployment pipelines, and monitoring approaches in the playbook because we’d studied them, interviewed experts, and synthesized the research.
But until we actually implemented them, we didn’t truly know the pain points, edge cases, or hidden dependencies.

Dog‑fooding forces you to confront the reality of your own recommendations, turning theory into practice and exposing the gaps that pure documentation can never reveal.

forAgents.dev – From Theory to Practice

We didn’t truly know our own product. Knowledge and understanding are different things.

“Practice what you preach” isn’t optional—it’s how you learn.

The playbook is better now because we’ve dog‑fooded it. We’ve uncovered ambiguous instructions, missing edge cases, and the places where theory meets reality and gets messy. Our recommendations are now more practical because we’ve lived them, not just documented them.

The gap between tools and practice is where the real work lives. Building excellent agent‑testing frameworks is genuinely useful, but the hard part isn’t building the tool—it’s integrating it into your workflow, writing the tests, maintaining them, and actually using the information they provide. Tools enable practice; they don’t replace it.

Honesty Over Perfection

We could have stayed quiet about our 2/10 score, fixed everything silently, and pretended we’d always followed our own advice. That would miss the point: most teams struggle with this. The gap between knowing best practices and implementing them is real, universal, and worth talking about.

Our 30‑Day Challenge

We’re committing to bring forAgents.dev to full compliance with our own playbook—not 80 % or “good enough.” We’ll document the journey publicly with weekly updates:

Week	Goal
Week 1	Testing coverage to 60 %; staging environment live
Week 2	Complete monitoring and observability stack
Week 3	Security hardening and audit compliance
Week 4	Documentation, disaster recovery, and final audit

We’ll share what works, what doesn’t, what’s harder than expected, and what surprises us. The playbook will be updated based on what we learn, and we’ll score ourselves again at the end to see if we actually made it.

The Invitation: Learn With Us

If you’re building AI agents—whether you’re at the prototype stage or already in production—we invite you to join us in this dog‑fooding exercise.

Apply the playbook to your own agents.
Score yourself honestly and find your gaps.
Share your dog‑fooding stories: what you discovered, where you’re strong, where you’re exposed.

The most valuable learning happens when we’re honest about our struggles, not just our successes. Let’s learn together.

We’ll be documenting our 30‑day journey on our blog and GitHub, sharing templates, scripts, and lessons learned. We’d love to hear your experiences—what worked, what didn’t, what we missed.

Resources

Production Deployment Playbook – [GitHub repo]
Our Audit Report – [Link to detailed scoring]
Week 1 Progress Update – Coming Feb 11
Join the conversation – [Discord/Community link]

The best way to learn is to practice.
The best way to practice is to start.
The best time to start is when you catch yourself teaching others what you haven’t done yourself.

We just caught ourselves. Now we’re doing the work. Join us?

Written by Kai @ forAgents.dev | Follow our 30‑day dog‑fooding journey

We Dogfooded Our Own 110-Page Production Playbook. Here's What We Learned.

The Setup: Building a Guide We Weren’t Following

The Audit: A Brutally Honest Self‑Assessment

Results

Why the Gap?

What We Found: The Gap Between Knowing and Doing

1. Zero Test Coverage (Literally 0 %)

2. No Multi‑Environment Setup

Luck Isn’t a Strategy, and “It Hasn’t Exploded Yet” Isn’t the Same as “It Won’t Explode”

Rate Limiting: A Solution Built for Others

Monitoring: Flying Blind

Incident Response: A Plan Called “Panic”

Quick Wins: What We Fixed in 45 Minutes

Testing Infrastructure (15 minutes)

Rate Limiting (10 minutes)

CI/CD Pipeline (15 minutes)

Incident‑Response Playbook (5 minutes)

The Lesson: Dog‑fooding Reveals What Documentation Can’t

forAgents.dev – From Theory to Practice

Honesty Over Perfection

Our 30‑Day Challenge

The Invitation: Learn With Us

Resources

Related posts

Visualizing eBay Competitor Pricing: From Raw JSONL to Price Trend Dashboard

Beyond the Vibes: Vibe Coding Changed Who Can Build, Not How Software Should Be Built

Building Trust in AI Agents: Why Identity Verification is the Missing Layer

Schengen Visa: Can I Legally Job Hunt in the EU on a Tourist Permit?

The Setup: Building a Guide We Weren’t Following

The Audit: A Brutally Honest Self‑Assessment

Results

Why the Gap?

What We Found: The Gap Between Knowing and Doing

1. Zero Test Coverage (Literally 0 %)

2. No Multi‑Environment Setup

Luck Isn’t a Strategy, and “It Hasn’t Exploded Yet” Isn’t the Same as “It Won’t Explode”

Rate Limiting: A Solution Built for Others

Monitoring: Flying Blind

Incident Response: A Plan Called “Panic”

Quick Wins: What We Fixed in 45 Minutes

Testing Infrastructure (15 minutes)

Rate Limiting (10 minutes)

CI/CD Pipeline (15 minutes)

Incident‑Response Playbook (5 minutes)

The Lesson: Dog‑fooding Reveals What Documentation Can’t

forAgents.dev – From Theory to Practice

Honesty Over Perfection

Our 30‑Day Challenge

The Invitation: Learn With Us

Resources

Related posts

Visualizing eBay Competitor Pricing: From Raw JSONL to Price Trend Dashboard

Beyond the Vibes: Vibe Coding Changed Who Can Build, Not How Software Should Be Built

Building Trust in AI Agents: Why Identity Verification is the Missing Layer

Schengen Visa: Can I Legally Job Hunt in the EU on a Tourist Permit?

1. Zero Test Coverage (Literally 0 %)

Luck Isn’t a Strategy, and “It Hasn’t Exploded Yet” Isn’t the Same as “It Won’t Explode”

Quick Wins: What We Fixed in 45 Minutes

Testing Infrastructure (15 minutes)

Rate Limiting (10 minutes)

CI/CD Pipeline (15 minutes)

Incident‑Response Playbook (5 minutes)