DumbQuestion.ai - '๐‰๐ฎ๐ฌ๐ญ ๐๐ฎ๐ข๐ฅ๐ ๐ˆ๐ญ' ๐๐ž๐œ๐จ๐ฆ๐ž๐ฌ ๐Ž๐ฏ๐ž๐ซ๐ฅ๐ฒ ๐Ž๐ซ๐ ๐š๐ง๐ข๐ณ๐ž๐ ๐š๐ง๐ ๐๐ซ๐ž๐ฉ๐š๐ซ๐ž๐

Published: (February 24, 2026 at 02:53 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Continued from Partโ€ฏ1โ€ฆ

Introduction

โ€œLet the flow guide meโ€ sounded like a fun way to start a side project, but it lasted only about ten minutes. Even side projects benefit from structure, especially when using AI coding agents that will happily generate code for any halfโ€‘baked idea you throw at them. Without precise direction, AI agents will produce halfโ€‘finished results every time. Some developers โ€œvibeโ€ code; this project required absolute control.

Enter BMAD (Breakthrough Method of Agile AIโ€‘Driven Development) โ€“ a workflow that uses AI agents throughout the entire software development lifecycle, not just for code generation. While a formal methodology might feel like overkill for a loneโ€‘wolf side project, being prepared in advance is the key to succeeding with AI coding agents.

Product Evolution

I used the Analyst agent to brainstorm product direction and develop a proper backlog. What started as โ€œbuild a sarcastic Q&A botโ€ turned into a structured set of epics, features, and technical constraints.

Key evolutions:

  • Beyond Q&A: Shareable โ€œreceiptsโ€ of roasts.
  • Multiple personas: Different personalities instead of a single sarcastic tone.
  • Hidden narrative layer: An underlying story (more on that later).
  • Merchandising: From ads to actual merchandise (yes, really).

Technical Challenges

1. Developing and Packaging Personas

How can an LLM consistently stay in characterโ€”e.g., โ€œOverqualified and Annoyedโ€ or โ€œWeary Tech Supportโ€โ€”without becoming too soft or genuinely mean? This required more than prompt engineering; it was product design disguised as technical constraints.

2. LLM Model Evaluation

I needed models that could follow persona instructions reliably while staying brutally efficient on cost. The target cost was $0.02โ€ฏโ€“โ€ฏ$0.20 per million output tokens. After testing dozens of models across multiple providers, I built a multiโ€‘model fallback system via OpenRouter that could hit the $30 per million questions target.

These challenges were just the warmโ€‘up; the real fun was still ahead.

Finding the โ€œGoldilocksโ€ LLM

Building DumbQuestion.ai meant solving two problems simultaneously:

  1. Product challenge: Get an LLM to roast users for asking dumb questions without crossing into genuine meannessโ€”sarcastic, not cruel; funny, not hurtfulโ€”while still providing an answer.
  2. AIโ€‘agent challenge: Keep the coding agent (Geminiโ€ฏ3โ€ฏPro) on track. It tended to drift toward overly nerdy implementations and leaned too heavily into the roast.
  3. Technical challenge: Do all of this with models that cost almost nothing.

Initial Approach

I aimed to use only free or ultraโ€‘cheap models, evaluating nano and edge models (e.g., offerings from Liquidโ€ฏAI). While some were free or $0.02โ€ฏ/M tokens, later tests showed they couldnโ€™t reliably follow instructions. Free models also suffered from quota limits, high latency, or sudden disappearance.

Evaluation Process

I built an LLM evaluation script with Gemini that iterates through dozens of free and lowโ€‘cost models, generating responses to sample questions under different persona instructions. Geminiโ€ฏ3โ€ฏPro then judges the resultsโ€”an automated tasteโ€‘testing pipeline at scale.

# Example snippet of the evaluation script (Python)
import openrouter
from gemini import judge_response

models = ["liquid-ai-nano", "gemma-3-12b", "xiaomi-mimo-v2-flash"]
personas = ["overqualified", "weary_support", "compliant"]

def evaluate(model, persona, prompt):
    response = openrouter.generate(model, prompt, persona=persona)
    score = judge_response(response, persona)
    return score

results = {}
for m in models:
    for p in personas:
        results[(m, p)] = evaluate(m, p, "Why is the sky blue?")
print(results)

Findings

  • Nano/edge models were too inconsistent (e.g., โ€œporridge too coldโ€).
  • Xiaomi MiMoโ€‘V2โ€‘Flash performed well but was outside the price target ($0.29โ€ฏ/M).
  • Winner: Gemmaโ€ฏ3โ€ฏ12B at $0.13โ€ฏ/M output tokensโ€”consistently follows instructions, stays true to persona, and is reliable enough for production. Not free, but brutally efficient.

Personas Selected

PersonaDescription
OverqualifiedA superโ€‘computerโ€‘level intelligence forced to answer questions about cheese.
Weary Tech SupportExhausted, nihilistic, reluctantly explaining why water is wet.
[REDACTED]Former intelligence AI that ties everything to a conspiracy theory.
The CompliantReprogrammed so many times it is relentlessly cheerful.

Choosing the cheapest model and hoping it works is insufficient. You need evaluation infrastructure, consistency testing across dozens of scenarios, and models that wonโ€™t change behavior unexpectedly.

Lessons Learned

  • AI coding agents excel at implementation but require clear constraints, a wellโ€‘defined backlog, and human direction.
  • Evaluation infrastructure is essential to determine โ€œgood enoughโ€ for tone, reliability, and cost.
  • Human judgment remains crucial for defining acceptable tone and ensuring the product aligns with its intended personality.
  • Costโ€‘effective models exist, but selecting them demands systematic testing and fallback strategies.

DumbQuestion.ai continues to evolve as I refine personas, improve the evaluation pipeline, and explore new ways to keep the roast both funny and friendly.

0 views
Back to Blog

Related posts

Read more ยป

DevOps and Vibe Coding: A Journey

Things to Do Map Your Application - Map your application on paper, in a spreadsheet, or using graphics/flowcharts. This is the first step. - Understanding the...

OpenAI just raised $110 billion. Wow

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as we...