Implementing Claude Code's Memory Model as a Dreaming Layer on 58 Articles

Published: (June 10, 2026 at 04:03 PM EDT)
9 min read
Source: Dev.to

Source: Dev.to

Cover image for Implementing Claude Code's Memory Model as a Dreaming Layer on 58 Articles

              [![shinji shimizu](https://media2.dev.to/dynamic/image/width=50,height=50,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3945785%2F5b60b30f-9e75-488a-8dcc-da3545ceca41.png)](https://dev.to/shinji_shimizu_bb51276a5e)
              
              
            
      
            

I built a pipeline in a single session that consolidates the 58 tech-blog articles of my service Kotonia (ja/en/zh) into a semantic index, then uses that index to detect duplicates for new article mining. Raw articles → semantic index → TF-IDF dedup → chunked draft generation — full path running on local Gemma 4 26B driven by Codex CLI. Design and implementation notes follow.

The motivation and “how solo developer accumulated assets compound” framing is in the companion piece: The Day a Solo Developer’s Accumulated Assets Finally Started to Compound

This piece keeps the technical notes.

  1. The Problem — When Title-Only Dedup Broke

Mining v1 produced a draft and I (the user) noticed “this overlaps with an existing article.” The overlap target was voice-first-local-llm (importance=9 flagship).

  • New draft thesis: “tokens per chunk is a hidden voice-chat latency driver”

  • Existing article §3.3: ”★ Streaming granularity — the structural difference that decides voice experience”

Same numbers (Local Gemma 1.0 tok/chunk, Haiku 10-16, Gemini 8-24). A perfect duplicate.

The mining agent had called art-done-list (title + description) for the dedup check. But the existing article’s title is “Cutting short-form LLM latency from 600ms to 22ms,” with TTFB as the headline sales pitch; §3.3 streaming granularity is buried in an H2 subsection. At title level, nothing overlapped, so the check came back clean.

That’s the starting point for this article.

  1. The Design — Three Layers: episodic ↔ semantic ↔ procedural

Breaking down why Claude Code’s memory system works:

  • Entries are small (1-3KB, one topic each) → subtopics don’t get buried

  • Hooks are retrieval-tuned and curated → search terms re-appear in the hook

  • A smart model writes hooks semi-autonomously → past me distills for future me

Articles have the opposite shape. Each 5-15KB, important subtopics buried in subsection bodies, descriptions are SEO summaries rather than retrieval-tuned, too heavy for an agent.

I bridged them with an intermediate layer named the Dreaming layer. Literally the biological “memory consolidation during sleep — hippocampus to cortex” metaphor.

episodic (raw articles + memory files)
    ↓ Dreaming agent (periodic digestion)
semantic (concepts_covered_ja[] / importance / data_points / sections)
    ↓ agent reverse-lookup (art-concepts-find / TF-IDF cosine)
procedural (mining / drafting / publishing)
Enter fullscreen mode


Exit fullscreen mode

A semantic entry for an article looks like:

{
  "slug": "voice-first-local-llm",
  "locale": "ja",
  "thesis_ja": "Ditching API, building voice-first with self-hosted local 26B",
  "importance": {
    "score": 9,
    "factors": {
      "pv_count_30d": 6,
      "avg_scroll": 67.0,
      "avg_dwell_sec": 170,
      "has_bench_data": true,
      "novelty_high": true
    }
  },
  "concepts_covered_ja": [
    "TTFB (time-to-first-byte): local vs API",
    "Streaming granularity (tokens per chunk)",
    "Gemma 4 26B model selection rationale",
    "Ditto + LLM co-residency GPU design"
  ],
  "data_points": [
    {"name": "TTFB Local", "value": "17-25ms"},
    {"name": "Streaming granularity Local", "value": "1.0 tok/chunk"}
  ],
  "sections": [
    {"id": "3.3", "title": "Streaming granularity — the structural difference that decides voice experience"}
  ]
}
Enter fullscreen mode


Exit fullscreen mode

The key point: concepts_covered_ja[] must be normalized to Japanese canonical names. Translated EN/ZH articles use the same JP concept strings. That single normalization becomes the dedup primitive downstream.

  1. Tools — Thin CLIs the Agent Calls

Codex CLI drives Gemma 4 26B locally. Tool calling via --enable-auto-tool-choice --tool-call-parser gemma4 gives an OpenAI-compatible surface. Each tool is ~50-100 lines of Python (stdlib only), art- prefix:

tool role

art-articles-list --needs-dreaming DB ∪ FS articles + dreaming state

art-pv-count --slug X analytics_events → PV / scroll / dwell

art-source-pull [--section N] pull just one H2/H3 section of an article

art-dream-write upsert a semantic entry into articles_index.jsonl

art-concepts-find concept → article reverse-lookup (the mining dedup primitive)

art-ideas-check evaluate a candidate idea via TF-IDF (the core of this article)

art-ideas-add push an idea to the pool (calls art-ideas-check internally)

art-draft-append append a chunk of draft body to a buffer

art-draft-commit finalize buffer → articles/_drafts/.md

The Dreaming agent semantically encodes one article at a time using these. Importance scoring uses this rubric:

+2: PV >= 100 (sigmoid log-scale)
+1: avg_scroll >= 0.7 AND avg_dwell_sec >= 60
+2: bench numbers / failure root cause / named decision
+2: novel concept not yet in index
+1: evergreen value (not time-sensitive)
-2: redundant with an already-indexed flagship
Enter fullscreen mode


Exit fullscreen mode

PV comes from a homegrown analytics_events table (cookie-less first-party tracker). The fact that the article platform and analytics co-reside in one DB you can hit directly is a solo-dev win.

  1. TF-IDF Dedup — Substituting Tool Structure for Agent Self-Discipline

At mining v1 the prompt instructed the agent to call art-concepts-find for dedup. The agent slipped through three duplicates anyway (details: Don’t Trust an Agent’s Self-Discipline).

The fix: embed a dedup gate directly inside art-ideas-add. The guts of evaluate_idea():

def evaluate_idea(title, angle, sources, ...):
    articles, ideas = load_corpus()
    # infer the candidate's concepts from the canonical vocab
    pseudo = {"concepts": _infer_concepts(title, angle, sources, articles)}

    # IDF (rare concepts weighted more)
    idf = build_idf(articles + ideas)
    new_vec = vectorize(pseudo["concepts"], idf)

    conflicts = []
    for a in articles:
        sim = cosine(new_vec, vectorize(a["concepts"], idf))
        if a["importance_score"] >= 7 and sim >= 0.25:
            conflicts.append({"kind": "flagship_concept", ...})
    for i in ideas:
        sim = cosine(new_vec, vectorize(i["concepts"], idf))
        if sim >= 0.35:
            conflicts.append({"kind": "pool_dup", ...})

    return {"allow": not conflicts, "conflicts": conflicts}
Enter fullscreen mode


Exit fullscreen mode

Three traps along the way in _infer_concepts():

Trap 1: substring-match false positives

The ASCII term “check” matches inside “checkout”; “PRO” inside “prod_”. The Stripe idea was falsely matched into “品質チェック (quality check/retry)” or “Blackwell Max-Q (RTX PRO 6000)” and rejected.

Fix: ASCII terms require word boundary; JP terms can stay substring.

def _term_matches(term: str, text: str) -> bool:
    if _ASCII_RE.match(term):
        pattern = r"(?= 0.30, but a binary vector with 4 concepts and 1 shared concept maxes around cosine 0.25. Even with IDF weighting, 0.27-0.30 was the borderline. Dropped to 0.25 and instead tightened the precision of the substring matcher (the false-positive engine).

Regression test: 4/4 across the known 4 cases (OpenWeight NSFW / streaming-granularity / CodeFormer / Stripe).

  
  
  5. Small-Model Specific Traps — Codex CLI + 26B Uncensored

Driving local 26B (Gemma 4 26B A4B Uncensored MAX) through Codex CLI, I observed 4 failure modes and their fixes:

**Trap 4: descriptive prompt → "I will begin by surveying..." then exit**

The first mining run had the agent summarize "what I'll do next" and exit with zero tool calls. Fix:

Critical: do not narrate, plan, or describe what you will do. Just call tools. The first action must be shell({"command": "art-..."}) — start there.


    Enter fullscreen mode
    

    Exit fullscreen mode
    

Imperative + first-action explicit, and it starts moving.

**Trap 5: huge tool output triggers a generation loop**

`art-commits-recent --since "60 days ago" --include-files` returned ~1300 lines of JSON including bodies; the agent then emitted ~25K tokens of output continuously, never stopping. Fix: `art-commits-recent` defaults to subject-only; body via `--include-body` opt-in.

**Trap 6: 5KB+ heredoc in tool_call.arguments JSON breaks the escape**

Sending `art-draft-save  <<'EOF' ... 5KB body ... EOF` as a single shell tool_call reliably breaks 26B's string escaping inside the arguments JSON (`Unterminated string at column 5083`).

Fix: split into chunked append + commit. ~200-800 chars per chunk, 4-8 appends, final commit:

art-draft-append my-slug <<‘KOTONIA_EOF’

title: ”…”

KOTONIA_EOF

art-draft-append my-slug <<‘KOTONIA_EOF’

1. First section

… KOTONIA_EOF

…repeat per section…

art-draft-commit my-slug


    Enter fullscreen mode
    

    Exit fullscreen mode
    

Each tool_call's arguments JSON stays small, escape break vanishes.

**Trap 7: Codex exec self-terminates after ~4 articles**

There seems to be an implicit constraint where one `codex exec` invocation finishes with a summary message after ~25K tokens / ~4 articles. Codex's Goals feature (`thread_goals.objective`) could prevent that, but you can't set it via `exec` (only the interactive TUI as of v0.133).

Fix: wrap `dispatcher.sh` in an external loop. Restart `codex exec` until `pending == 0`.

max_cycles=30 cycle=0 while (( cycle < max_cycles )); do pending=$(art-articles-list —needs-dreaming —count-only) if (( pending == 0 )); then break; fi run_codex dream cycle=$((cycle + 1)) done


    Enter fullscreen mode
    

    Exit fullscreen mode
    

That gets 58 articles digested in 2-3 cycles.

  
  
  6. What Landed

The working pipeline:

- 58 articles → semantic index, importance bell-shaped (median 5-6), flagship recognition correct (voice-first-local-llm at score 9 across all locales)

- 70 memory files mined for unexplored concepts, 4 ideas land in the pool as survivors

- 4 drafts generated, ~3.6-4.6KB each, publish-ready after 10-20 minutes of human polish

- TF-IDF dedup gate at the tool layer blocks any agent self-discipline violation

Repo: [github coming soon]

  
  
  7. Generalization

The structure — **raw assets → semantic compression → agent reverse-lookup** — generalizes beyond articles:

**Test generation**: semantically compress existing tests, mine uncovered branches, draft new tests

**PR descriptions**: semantically compress the codebase delta, dedupe against unrelated PRs, draft a description

**Support FAQs**: semantically compress past support tickets, surface uncovered topics, draft new FAQs

**Personal knowledge base**: Scrapbox / Notion accumulation → semantic compression → mechanically discover unexplored concepts

Common design principles:

**Raw assets are heavy**. Don't load them directly — insert a consolidation layer.

**The canonical vocabulary is the semantic-layer primitive**. Without normalization, dedup doesn't work.

**Enforcement belongs at the tool layer**. Agent self-discipline is unstable; bake the rule into the structure.

Knowing this opened up application to other domains in kotonia (persona generation in character chat, TTS prompt accumulation, etc.).

  
  
  Aside: Development Time

One session (~6h). Dreaming layer design → 5 new tools → Codex prompts → first-time consolidation → TF-IDF dedup → chunked draft → 4 article drafts generated, all in one stretch.

Local 26B as the "runs on electricity only" agent absorbed the grinding labor; the human only had to make judgment calls and steering corrections. Doing this on frontier APIs would have cost $50-100.

[Kotonia](https://kotonia.ai/) is a voice-first AI character chat platform. The drafts revived by this pipeline live on the same blog if you're curious.
0 views
Back to Blog

Related posts

Read more »