Your AI agent remembers what sounds related, not what worked

Published: 19 hours ago (June 13, 2026 at 07:19 PM EDT)

6 min read

Source: Dev.to

I spent a couple of weeks asking people a pretty basic question. If you are actually running agents, past the demo, in something resembling production, how do you handle memory? I was expecting a handful of tips. What I got instead was the same frustration over and over, and a problem that, as far as I can tell, nobody has cleanly solved yet. So I am writing it down, because if you build with agents you are going to run straight into it. The thing everyone starts with Most agent memory works the same way. Embed everything the agent has seen, store the vectors, and when a new task shows up, pull back whatever is closest and drop it into context. That is fine right up until it isn’t. The catch is that “closest in vector space” really means “sounds related,” and sounding related is not the same as having worked last time. So the agent recalls the thing that resembles the task in front of it, not the thing that actually helped. It will cheerfully head down a path it already failed three sessions ago, because nothing ever told it that path was a dead end. If you have watched an agent repeat its own mistake with total confidence, that is the whole bug right there. It is not stupid. It just never found out how the last attempt turned out. What people are actually doing about it Here is the part I did not expect. Almost everyone I talked to had already hit this and quietly built their own fix. And the fixes were all over the place, which to me is the tell that there is no standard answer yet. A few that kept coming up. Some people just use files. No memory platform, nothing fancy. Working memory lives in plain files the agent reads on startup, the agent decides what to write, and old stuff rolls off into a vector store later. For one person working alone this was apparently rock solid, and they were a little smug about it, fairly. Other people keep a separate failure log. Pull “this failed and here is why” out of the general memory entirely, and when the agent wonders whether it has tried something before, check that log first, ahead of the normal similarity search. Somebody put it in a way that stuck with me. Embeddings are great at recalling topics. They almost never hold on to “we went down this road and it blew up because of X.” A few have the agent write its own little post mortem after each task. Tried this, it broke because of that, next time do the other thing. Then search those before starting fresh. The honest downside they admitted is that after thirty or forty of these the file turns into noise, so they had to bolt on a step that summarizes the old ones. And some split memory into tiers. Stable facts the agent is allowed to trust, versus everything else, which it can mention but not act on unless it can point to where it came from. Different shapes, same underlying instinct. Stop pretending every memory is equally trustworthy. Where it all falls apart Once I lined these up next to each other, one thing jumped out. Every single approach handles what to write down. None of them really handles what to keep. Noticing that something failed turns out to be the easy half. You can catch tool errors, failed tests, timeouts, a change that got reverted. You can even treat “the task just ended and nobody ever confirmed it worked” as its own kind of failure, which is how you catch the quiet ones that never throw an error. It is everything after that gets hard. Which failures are worth keeping, and which were flukes. When a lesson stops being true because the system moved underneath it. How you stop a memory from sliding from “this happened once” into “this is the rule,” when nobody actually checked that it should be a rule. One person framed it in a way I keep coming back to. A memory should hold proof, not a moral. The raw event, what happened and the evidence for it, should stay put and stay checkable. The lesson you draw from it should be allowed to change when something later contradicts it. The moment those two things become a single object, the system starts defending its interpretation instead of just remembering what actually happened. Which, honestly, is a very human way to be wrong. What the newer tools still skip There is a fresh wave of memory tooling now that handles a nearby but different problem, which is tracking whether a stored fact is still true as time passes. Who owned this before, who owns it now. That is genuinely useful and a real step up from blind similarity. But notice it is answering a different question. “Is this fact still current” is not the same as “did acting on this memory actually lead somewhere good.” A fact can be perfectly up to date and still be the exact thing that sent the agent into the wall three times in a row. Whether something is still true and whether it ever worked are two different axes. Most of the field is busy on the first one. If you are building this today The practical stuff I took away, mostly secondhand from people deeper in it than me. Do not lean on similarity on its own. It hands you what looks related, not what helped. Treat failures as real memory, because what did not work is often more useful than what is merely similar. Keep the event and the lesson separate, so you can record what happened plainly and still revise the conclusion later. Put a real gate in front of what gets promoted into a durable rule, because noticing a break is not the same as having learned the right thing, and bad lessons calcify fast. And assume you will have to go back. A lesson that was true two weeks ago can be actively harmful once you have refactored the thing it was about. None of this is solved. The people doing it well are using sensible rules of thumb, recency, prove it twice, a human glance, the occasional cleanup pass. And every one of those rules breaks somewhere predictable. I do not think a better embedding model is the way out. The question feels different to me. Less “what is most similar to this,” and not even “what is still true,” but something closer to “what actually worked, and how do we hang onto that while the rest quietly fades.” If you are running agents in production and wrestling with this, I would genuinely like to hear how you handle it. The conversation that kicked all of this off taught me more than anything I have read on the topic.

Your AI agent remembers what sounds related, not what worked

Related posts

The Deep Mechanics of Online Bulk Deletion in PostgreSQL

From Mint to NixOS: Why a Long-Time Linux User Made the Switch

Making a fleet of self-hosted LLM agents trustworthy

How to Choose the Right Color Palette for UI/UX Design