Are We Over-Engineering LLM Stacks Too Early?
Source: Dev.to
Introduction
I’ve been building with LLMs for a while now, and I keep noticing the same pattern. A project starts simple:
response = client.responses.create(
model="gpt-4.1",
input="Summarize this document"
)
It works. It feels magical. A few weeks later, the architecture diagram looks like this (and this is before product‑market fit):
User
↓
Prompt Builder
↓
Context Aggregator
↓
Vector DB (Embeddings)
↓
Retriever
↓
Model Router
↓
LLM
↓
Post‑Processor
It makes me wonder whether we’re solving real problems or just future‑proofing imaginary ones.
Token Costs vs. Reasoning Quality
The first thing that usually breaks isn’t reasoning quality—it’s cost and context. Suddenly you realize your “simple” request is actually sending:
{
"system": "... 600 tokens ...",
"chat_history": "... 2,800 tokens ...",
"retrieved_chunks": "... 4,200 tokens ...",
"user_input": "Explain this"
}
And you’re wondering why the bill doesn’t match your mental math. Most early issues aren’t about model capability; they’re about what we’re sending to it.
Before touching architecture, I sometimes sanity‑check prompts with simple token estimators. I’ve occasionally used tools like to review token counts and compare model pricing. Nothing fancy—just clarity on how many tokens I’m actually burning.
Prompt Hygiene
Sometimes the insight is embarrassingly simple:
You are a helpful assistant specialized in summarization.
You are a helpful assistant specialized in summarization.
You are a helpful assistant specialized in summarization.
Repeated instructions cause hidden token leakage. Token awareness alone has changed more of my architectural decisions than switching models ever did.
RAG Overview
RAG is powerful, but I’ve also seen it introduced before it was truly needed. A typical RAG setup looks something like:
# Chunking
chunks = chunk_document(document, size=800)
# Embedding
embeddings = embed(chunks)
store(embeddings)
# Retrieval
query_embedding = embed(user_query)
context = retrieve_similar(query_embedding, top_k=5)
# Generation
response = llm.generate(context + user_query)
Elegant in theory, but each step adds:
- Embedding cost
- Storage cost
- Chunking decisions
- Retrieval tuning
- Evaluation overhead
Sometimes that’s justified. Other times the knowledge base is small enough that static context would work, simple caching would solve most of it, or trimming the prompt would eliminate the need for retrieval entirely.
When Complexity Becomes Necessary
Complexity compounds quickly. I’ve caught myself optimizing token efficiency for features that didn’t even have users yet:
- Reducing 4,200 tokens to 3,600 tokens
- Switching models to save $0.002 per request
- Designing fallback routing logic
All before validating whether the output itself mattered. Classic engineer reflex.
Reflection
- When did complexity become necessary for you?
- At what point did token cost become painful enough to justify additional layers?
- If you rebuilt your stack from scratch, what would you deliberately not add this time?
It feels like we’re collectively figuring this out in real time. I’d love to hear how others are navigating it.