I tuned Hindsight for long conversations

Published: (March 23, 2026 at 01:34 PM EDT)
4 min read
Source: Dev.to

Source: Dev.to

Job‑Sense AI

Last night I wondered if an agent could learn what not to recommend in a job‑matching pipeline; by morning, ours was blacklisting patterns it had only seen fail once—and getting better every run.

Overview

What I built isn’t another resume‑to‑job similarity tool. It’s a loop:

  1. Ingest resumes and job descriptions.
  2. Generate matches.
  3. Evaluate those matches.
  4. Feed the failures back into the system so it stops making the same mistake twice.

Repository layout

DirectoryPurpose
main.pyWires the pipeline together
matcher/Handles embedding + similarity scoring
agent/Wraps the LLM logic (ranking, reasoning, critique)
memory/Stores “Hindsight” data
evaluation/Defines what a “bad recommendation” actually means

The interesting part isn’t matching – it’s what happens after a bad match.


The thing I got wrong about “agent memory”

My initial assumption (embarrassingly common) was:

memory = more context

def rank_candidates(job, resumes):
    context = build_context(job, resumes)
    return llm.generate_rankings(context)

It worked for obvious matches but collapsed on edge cases:

  • Overweighting keyword overlap (“Python” everywhere)
  • Ignoring disqualifiers buried in experience
  • Recommending over‑qualified or irrelevant candidates

I first tried “improve prompts” and “add more examples.” That helped a bit, but the same class of mistake kept returning.

What I actually needed was not better context—but persistent negative feedback.


Turning mistakes into data (with Hindsight)

The shift happened when I integrated Hindsight (GitHub). Instead of trying to prevent bad outputs upfront, I let the system fail and then recorded why it failed.

Recording a failure

def record_failure(job_id, resume_id, reason):
    memory.store({
        "type": "negative_match",
        "job_id": job_id,
        "resume_id": resume_id,
        "reason": reason,
        "timestamp": now()
    })

These aren’t just logs. They’re indexed and later retrieved during ranking.

If you haven’t seen it, the Hindsight documentation explains the retrieval model well—but the real insight is what you choose to store.

What I store

I don’t store full conversations. I store compressed lessons:

  • “Rejected: frontend‑heavy profile for backend‑only role”
  • “Mismatch: required 5+ years, candidate has 1.5”
  • “False positive due to keyword overlap (React vs React Native backend tooling)”

That compression step matters more than anything else.


Injecting “don’t do this again” into the agent

Once failures are stored, the next step is using them during ranking.

def rank_with_memory(job, resumes):
    past_failures = memory.retrieve(
        query=job.description,
        filter={"type": "negative_match"},
        top_k=5
    )
    context = build_context(job, resumes, past_failures)
    return llm.generate_rankings(context)

Key point: past_failures are semantically retrieved – I’m not just filtering by job ID; I’m asking:

“What past mistakes look similar to this job?”

This is where a vector‑based memory layer shines. You’re not building a static DB; you’re building a memory system that can generalize.


What surprised me: the agent started arguing with itself

When failures were injected into the context, the LLM began pre‑emptively rejecting candidates before ranking them. I made that explicit with a critique step:

def critique_candidate(job, resume, failures):
    return llm.generate({
        "job": job,
        "resume": resume,
        "past_failures": failures,
        "task": "Should this candidate be rejected? Why?"
    })

Updated pipeline

  1. Generate candidate scores.
  2. Run critique step.
  3. Adjust ranking or drop candidates.

This turned the agent into a two‑pass system:

  • Pass 1: “Who looks good?”
  • Pass 2: “Why might this be wrong?”

The second pass delivered most of the improvements.


The blacklist isn’t static—and that’s the whole point

I briefly considered a rule engine:

If experience  model quality**  
  - A slightly worse model with good failure retrieval outperformed a better model with none.  

- **Two‑pass systems are underrated**  
  - `Generate → critique` is dramatically more stable than single‑pass ranking.  

- **Don’t build rules when you can build feedback loops**  
  - Static rules age poorly. Feedback systems adapt.  

---

## If I were to rebuild this  

I’d double down on the memory layer earlier, specifically:  

- Better schema for failure types  
- Explicit clustering of similar mistakes  
- Decay or pruning of outdated memories  

Right now it works—but it’s still naive.  

---

## Closing thought  

If you’re building anything that **ranks, recommends, or decides**—don’t just ask:

> “How do I make it better?”

Ask:

> “How do I make it **remember being wrong**?”

That one shift changed this project from a brittle matcher into something that actually improves over time.  

If you want to go deeper into how this kind of memory layer works, check out:

- The **Hindsight** GitHub repository  
- The **Hindsight** documentation  
- The agent memory approach from **Vectorize**  

The implementation details matter—but the bigger idea is simple:

> **Don’t just build systems that predict.  
> Build systems that regret—and remember why.**
0 views
Back to Blog

Related posts

Read more »