Using Code Agents Effectively
Source: Dev.to
Overview
Every time LLMs get an “upgrade” they feel like magic – the first impressions are always strong.
That’s how I felt with code agents: suddenly the models were hyper‑context‑aware to my exact system and could iterate on problems until they produced a working solution.
The craziest part? I didn’t even have to understand the solution because the agent had already implemented it.
But, as it always goes, playing with this new capability quickly reveals hard limits. Beyond toy projects the agents begin to cut major corners, and when you try to edit a large repo they often fail completely.
The latest models (GPT‑5‑Codex, Sonnet 4.5) are well into the territory of being as capable as a new engineer. However, they need to be directed with extreme nuance and can’t just be let loose on problems like an eager new grad. The models are smart enough to get just about anything done, but not without a lot of teamwork.
Below I detail the strategies I’m using to get good results with the current batch of cutting‑edge LLMs and tools.
1. Context Is Worth More Than Gold
Even with large token windows, every token is poison that can ruin your results. The problem comes from three major angles:
Price
- With current Anthropic pricing a single prompt can cost over $30.
Task‑overload
- The model is thinking about its system prompt, every
# TODOit finds, all of your instructions, etc. - All of these compound to confuse the model.
Retrieval
- Every model handles token limits differently; in reality none of them truly hold hundreds of thousands of tokens in their “head”.
- Retrieval is probably the most important area of advancement for LLMs right now and is far from a solved problem.
2. Agent‑of‑Agents: A “Crazy Hack” That Works
Having your primary agent call other agents gives you three big benefits:
- Higher‑quality answers – a fresh agent tackles a sub‑task with a clean context.
- Smaller main‑agent context – keeps the primary agent’s token window lean.
- Cost savings – smaller prompts = cheaper calls.
Typical workflow for code‑pairing
Two tasks happen a lot when doing code pairing at work with an agent:
- Running build systems
- Reading proprietary docs via MCP
Both require only a small amount of information, yet the tools dump massive token payloads:
- Average docs page ≈ 10 k tokens
- A small
makebuild can be thousands of tokens
Solution: Create a fresh agent and tell it, e.g.:
“Run
make build. I editedxxx.hpp. Let me know if there are any errors.”
The agent then returns concise binary feedback such as “Your change worked” or detailed error messages.
3. Model‑Specific Strengths
- Claude Code v2.0 is particularly good at this pattern.
- I used to manage the cycle myself and start new models frequently, but now I can keep the same chat going.
“It’s more vibes than strict rules, but I tend to compact after a new feature is implemented and tests are passing – pretty much the same milestones as when I would make a code commit.”
4. Working in Large, Proprietary Repos
- You must be extremely explicit with tasking, yet you can’t overload the model with too much info.
- Dumping all docs and folders into context lobotomizes the model.
Practical tips
- Tell the agent exactly where to look and which path to take for implementation.
- If you don’t know those answers, use a few different agents to gather the relevant info and help you craft a precise prompt for the execution agent.
5. Avoid “Poisonous” System‑Level Prompts
Bad example
“Use
uvfor all Python commands.”
The model kept this rule in mind after every step, even when we weren’t in a Python repo anymore.
Recommendation
- Avoid broad system‑level prompts.
- Refine prompts or
AGENTS.mdper repository instead.
Real‑world incident
I was working in an old repo with a deprecated tool that printed an upgrade warning on every build. The model assumed the warning was the root cause and started installing the latest update, breaking the environment. I solved it by wrapping the function to hide the warning.
6. Hierarchical Agent Collaboration
Think of this as sub‑agents on a higher level:
- One manager agent works with me on the big picture and helps craft prompts for a worker agent.
- The manager knows the high‑level goal; the worker handles the low‑level execution.
Typical workflow
- Open two terminals:
- Manager Agent – stays alive, knows the overall objective.
- Worker Agent – gets reset often when we hit context limits or go down a wrong path.
- Collaborate with the manager to decide which prompt to give the worker.
- When the worker gets stuck, the manager helps identify missing context or a higher‑level approach.
“This is a highly iterative process.”
I believe the current batch of models is good enough to automate this fully: a manager agent could programmatically break down and distribute tasks. I haven’t tested it with the latest models yet—Sonnet 4 wasn’t sufficient, but Opus 4.1 was close.
7. Self‑Verification Is Essential
A key difference between agents and chatbots is that agents must be able to check their own work.
- Provide a clear path for the agent to build and test the code.
- Write tests before implementation (either yourself or a different agent).
- Tests act as tight feedback loops.
- They also serve as excellent documentation, making the code far more concise than a block of English description.
8. TL;DR – The Combo That Makes a Difference
| Aspect | Why It Matters | How to Handle It |
|---|---|---|
| Context size | Tokens are expensive and can poison results | Keep prompts minimal; use retrieval; split tasks |
| Pricing | High token usage → high cost | Use sub‑agents; limit context |
| Task overload | Too many instructions confuse the model | Be explicit, break down tasks |
| Retrieval | Models can’t hold massive token windows | Use external retrieval systems, keep context lean |
| Agent hierarchy | Fresh context = better answers | Manager ↔ Worker agents pattern |
| Self‑verification | Prevent junk output | Require build & test steps, generate tests first |
By treating context as a premium resource, splitting work among specialized agents, and forcing self‑verification, you can get reliable, cost‑effective results from today’s cutting‑edge LLMs.
Quality of answers
Anthropic recently put out a post‑mortem where they admitted that bugs in their infrastructure made their models dumber. So, if one day something works and the next it doesn’t, it is probably not your fault.
If someone releases a new update, you should probably try it for yourself to see which combination of mode, tool, or task really excels.
Keep your context close and your sub‑agents closer. The best you can do is keep your context as small as possible, and don’t be afraid to fully restart if the current path isn’t great.
I keep thinking that some of this stuff won’t matter with the next era of models, but every time they ship I find myself tracking more things. I’ve certainly been displaced from some lines of work by agents, but we are a long way away from them building cool stuff at scale.