Using Code Agents Effectively

Published: 2 months ago (February 22, 2026 at 04:36 PM EST)

7 min read

Source: Dev.to

Source: Dev.to

Overview

Every time LLMs get an “upgrade” they feel like magic – the first impressions are always strong.
That’s how I felt with code agents: suddenly the models were hyper‑context‑aware to my exact system and could iterate on problems until they produced a working solution.

The craziest part? I didn’t even have to understand the solution because the agent had already implemented it.

But, as it always goes, playing with this new capability quickly reveals hard limits. Beyond toy projects the agents begin to cut major corners, and when you try to edit a large repo they often fail completely.

The latest models (GPT‑5‑Codex, Sonnet 4.5) are well into the territory of being as capable as a new engineer. However, they need to be directed with extreme nuance and can’t just be let loose on problems like an eager new grad. The models are smart enough to get just about anything done, but not without a lot of teamwork.

Below I detail the strategies I’m using to get good results with the current batch of cutting‑edge LLMs and tools.

1. Context Is Worth More Than Gold

Even with large token windows, every token is poison that can ruin your results. The problem comes from three major angles:

Price

With current Anthropic pricing a single prompt can cost over $30.

Task‑overload

The model is thinking about its system prompt, every # TODO it finds, all of your instructions, etc.
All of these compound to confuse the model.

Retrieval

Every model handles token limits differently; in reality none of them truly hold hundreds of thousands of tokens in their “head”.
Retrieval is probably the most important area of advancement for LLMs right now and is far from a solved problem.

2. Agent‑of‑Agents: A “Crazy Hack” That Works

Having your primary agent call other agents gives you three big benefits:

Higher‑quality answers – a fresh agent tackles a sub‑task with a clean context.
Smaller main‑agent context – keeps the primary agent’s token window lean.
Cost savings – smaller prompts = cheaper calls.

Typical workflow for code‑pairing

Two tasks happen a lot when doing code pairing at work with an agent:

Running build systems
Reading proprietary docs via MCP

Both require only a small amount of information, yet the tools dump massive token payloads:

Average docs page ≈ 10 k tokens
A small make build can be thousands of tokens

Solution: Create a fresh agent and tell it, e.g.:

“Run make build. I edited xxx.hpp. Let me know if there are any errors.”

The agent then returns concise binary feedback such as “Your change worked” or detailed error messages.

3. Model‑Specific Strengths

Claude Code v2.0 is particularly good at this pattern.
I used to manage the cycle myself and start new models frequently, but now I can keep the same chat going.

“It’s more vibes than strict rules, but I tend to compact after a new feature is implemented and tests are passing – pretty much the same milestones as when I would make a code commit.”

4. Working in Large, Proprietary Repos

You must be extremely explicit with tasking, yet you can’t overload the model with too much info.
Dumping all docs and folders into context lobotomizes the model.

Practical tips

Tell the agent exactly where to look and which path to take for implementation.
If you don’t know those answers, use a few different agents to gather the relevant info and help you craft a precise prompt for the execution agent.

5. Avoid “Poisonous” System‑Level Prompts

Bad example

“Use uv for all Python commands.”

The model kept this rule in mind after every step, even when we weren’t in a Python repo anymore.

Recommendation

Avoid broad system‑level prompts.
Refine prompts or AGENTS.md per repository instead.

Real‑world incident

I was working in an old repo with a deprecated tool that printed an upgrade warning on every build. The model assumed the warning was the root cause and started installing the latest update, breaking the environment. I solved it by wrapping the function to hide the warning.

6. Hierarchical Agent Collaboration

Think of this as sub‑agents on a higher level:

One manager agent works with me on the big picture and helps craft prompts for a worker agent.
The manager knows the high‑level goal; the worker handles the low‑level execution.

Typical workflow

Open two terminals:
- Manager Agent – stays alive, knows the overall objective.
- Worker Agent – gets reset often when we hit context limits or go down a wrong path.
Collaborate with the manager to decide which prompt to give the worker.
When the worker gets stuck, the manager helps identify missing context or a higher‑level approach.

“This is a highly iterative process.”

I believe the current batch of models is good enough to automate this fully: a manager agent could programmatically break down and distribute tasks. I haven’t tested it with the latest models yet—Sonnet 4 wasn’t sufficient, but Opus 4.1 was close.

7. Self‑Verification Is Essential

A key difference between agents and chatbots is that agents must be able to check their own work.

Provide a clear path for the agent to build and test the code.
Write tests before implementation (either yourself or a different agent).
- Tests act as tight feedback loops.
- They also serve as excellent documentation, making the code far more concise than a block of English description.

8. TL;DR – The Combo That Makes a Difference

Aspect	Why It Matters	How to Handle It
Context size	Tokens are expensive and can poison results	Keep prompts minimal; use retrieval; split tasks
Pricing	High token usage → high cost	Use sub‑agents; limit context
Task overload	Too many instructions confuse the model	Be explicit, break down tasks
Retrieval	Models can’t hold massive token windows	Use external retrieval systems, keep context lean
Agent hierarchy	Fresh context = better answers	Manager ↔ Worker agents pattern
Self‑verification	Prevent junk output	Require build & test steps, generate tests first

By treating context as a premium resource, splitting work among specialized agents, and forcing self‑verification, you can get reliable, cost‑effective results from today’s cutting‑edge LLMs.

Quality of answers

Anthropic recently put out a post‑mortem where they admitted that bugs in their infrastructure made their models dumber. So, if one day something works and the next it doesn’t, it is probably not your fault.

If someone releases a new update, you should probably try it for yourself to see which combination of mode, tool, or task really excels.

Keep your context close and your sub‑agents closer. The best you can do is keep your context as small as possible, and don’t be afraid to fully restart if the current path isn’t great.

I keep thinking that some of this stuff won’t matter with the next era of models, but every time they ship I find myself tracking more things. I’ve certainly been displaced from some lines of work by agents, but we are a long way away from them building cool stuff at scale.