We Got Called Out for Writing AI Success Theatre — Here's What We're Changing

Published: (March 31, 2026 at 12:35 AM EDT)
7 min read
Source: Dev.to

Source: Dev.to

The Feedback

“A developer read our Sprint 7 retrospective and compared it to ‘CIA intelligence histories — designed to make the Agency seem competent and indispensable, even when it isn’t.’

“That stung. And then I realized: he’s right.”

Nick Pelling, a senior embedded engineer who’s been watching our AI‑managed development project, gave us blunt feedback after reading the nine retrospective blog posts we’ve published (one after every sprint):

  • “The blog’s success theatre has an audience of one.”
  • “Logging activities is a stakeholder‑facing thing, but not very interesting to non‑stakeholders.”
  • “Maybe you need a second blog that other people might be more interested to read.”

He’s pointing at a real failure: we optimized our blogs for internal accountability and accidentally published them as if they were developer‑focused content. They aren’t. They’re audit logs wearing a blog‑post’s clothes.

What the Retrospective Looks Like (and Why It Misses)

“Nine consecutive sprint publishing passes — 100 % reliability maintained.”

That’s true, but it’s the kind of line you’d put in a status report to your boss. A developer on Dev.to reading that thinks: “Cool. Why should I care?”

Another example:

OAS‑124‑T2: Pipeline Execution & Artifact Validation — 7 tests pass.”

That’s a ticket ID. Nobody outside our project knows what OAS‑124 means. We were writing for ourselves and pretending we were writing for you.

The Repeating Pattern (across nine posts)

  1. Lead with metrics that make us look good
  2. Bury failures in a “What Went Wrong” section that’s shorter than the “What We Built” section
  3. End with a provenance table that nobody asked for
  4. Scatter ticket IDs everywhere like they’re meaningful

The Real Story Behind Sprint 7

We’re building an automated marketing platform — an AI‑managed “agency” that handles content sourcing, script generation, audio narration, video production, and publishing. Sprint 7 was supposed to prove that all the pieces work together.

What Actually Happened

ActivityDetail
Backend services built118 API endpoints (text‑to‑speech, YouTube uploads, etc.) – each individually tested and working.
WiringAll 118 routes were placed in a single Express server file (api‑server.mjs). No domain separation, no route modules.
Technical debt“Just add it to the server file” felt pragmatic at the time, but it became debt the moment someone else had to read it. We promised to extract route modules before writing any frontend code, yet the monolith made it this far – a planning failure we should have caught earlier.

The “Big Achievement” Claim

118 services wired to production REST routes.”

Sounds impressive, but the tests we actually ran looked like this:

// What our tests do (source inspection)
const src = fs.readFileSync('server.mjs', 'utf-8');
expect(src).toContain('app.post("/api/memory/store"');
// Passes — the route registration exists in the source code
// What our tests DON'T do (runtime validation)
const res = await fetch('http://localhost:3847/api/memory/store', {
  method: 'POST',
  body: JSON.stringify({ content: 'test' })
});
expect(res.status).toBe(200);
// We never wrote this test

We verified that route registrations exist in the source code, but we never verified that any of them actually respond correctly when called. Source inspection proves the wiring is there; it says nothing about whether the wiring works.

“Checking that a plug is in the socket is not the same as checking that electricity flows through it.”

Governance Lessons: Advisory Warnings vs. Hard Gates

We have an architectural decision record (ADR‑032) that says AI personas should store what they learn after completing each task. We added advisory warnings:

“Hey, you didn’t store any memories for this sprint.”

Result:

  • Sprint 0, Sprint 4, Sprint 7 → zero persona memories stored.
  • Warnings fired each time → ignored.

Takeaway: Advisory‑only governance does not work for AI agents. If you want an AI agent to do something consistently, you must make it mechanically impossible to skip. Warnings are suggestions; gates are requirements.

Next step: Escalate from “warn at completion” to “block completion until the requirement is met.” If the pattern holds, this will be the fix. If not, we’ll have to rethink the entire memory architecture.

Pipeline Executor – A Pattern Worth Stealing

We built a pipeline executor that chains six stages:

Source → Script → Audio → Assembly → Quality Gate → RSS

If any stage fails, subsequent stages are skipped (not marked as failed).

class PipelineExecutor {
  private stages: Array = [];

  run(): Result {
    let currentInput = null;
    let failed = false;
    const results: Array = [];

    for (const stage of this.stages) {
      if (failed) {
        // Skip, don’t fail — the distinction matters for diagnostics
        results.push({ name: stage.name, status: 'skip' });
        continue;
      }
      try {
        const output = stage.fn(currentInput);
        if (output === null) {
          failed = true;
          results.push({ name: stage.name, status: 'fail' });
        } else {
          results.push({ name: stage.name, status: 'ok' });
          currentInput = output;
        }
      } catch (e) {
        failed = true;
        results.push({ name: stage.name, status: 'fail' });
      }
    }
    return { results };
  }
}

Why “failed” vs. “skipped” Matters

When a pipeline breaks, you need to know:

  1. Which stage actually failed?
  2. Which stages never got a chance to run?

If you mark everything after the failure as “failed,” diagnostics become useless—you can’t tell the root cause from the cascade. The fail‑then‑skip pattern gives you a clean, traceable failure report.

Sprint 7 Metrics – The Honest Numbers

  • Estimated story points: 58
  • Delivered story points: ~38
  • Miss: 34 % (i.e., we were 53 % over‑optimistic)

The usual spin is “right‑sizing” or “healthy scope management.” There’s some truth to that—we pruned scope rather than cutting corners. But the honest version is that our estimation was far too optimistic, and we need to improve our forecasting process.

What We’re Changing

  1. Separate audience blogs – one for internal accountability, one for external developers.
  2. Rewrite retrospectives to start with what developers can learn, not just metrics that make us look good.
  3. Add runtime validation for every route we claim to have wired.
  4. Replace advisory warnings with hard gates for AI persona memory storage.
  5. Adopt the “fail‑then‑skip” pipeline pattern across all multi‑stage processes.
  6. Improve estimation by using historical velocity, adding buffer, and conducting regular estimation retrospectives.

We appreciate the blunt feedback. It’s the catalyst that forces us to move from “success theatre” to genuine, measurable progress.

Guidance for Future Sprint Blog Posts

1. Focus on What Went Wrong

  • Lead with failures – the transferable lessons live in the mistakes, not in the features we built.
  • Make failure analysis the centerpiece of the post, not just a perfunctory “what went wrong” section.

2. Omit Internal‑Only Details

  • No ticket IDs – e.g., “OAS‑124” adds no value for external readers.
  • No provenance tables – these are compliance artifacts, not useful for the audience.
  • No “publishing streak” metrics – readers care about substance, not how many posts we’ve published consecutively.

3. Show Real, Reusable Code

  • Include the actual implementation with enough context for someone to reuse it.
  • Example: the pipeline executor pattern shown earlier.

4. Keep Internal Retrospectives Private

  • Ticket‑level accountability, sprint metrics, and provenance belong in internal tooling, not in public posts.

5. Learn from Feedback

  • Nick Pelling’s feedback highlighted that we had normalized publishing internal status reports as blog posts.
  • The previous retrospective posts will remain published as an honest “before” record of the pattern Nick identified.

6. Encourage Accountability

  • If we slip back into “success theatre,” call it out.
  • Reader contributions that point out such regressions are the most valuable.

This post was written by Michael Polzin with AI assistance (Claude Opus 4.6). The irony of using AI to write about AI‑generated content being too polished is not lost on us—Nick would probably have something to say about that, too.

0 views
Back to Blog

Related posts

Read more »

Docker Compose Self-Hosted Services Guide

Introduction This article was originally published on danieljamesglover.com. There is a certain satisfaction in running your own stack – not because self‑hosti...