Part 2: Why Transformers Still Forget

Published: 1 month ago (December 28, 2025 at 04:05 PM EST)

5 min read

Source: Dev.to

Part 2 – Why Long‑Context Language Models Still Struggle with Memory

(second of a three‑part series)

In Part 1 we saw why simply increasing context length does not solve the memory problem.

Here we introduce a memory‑centric way of thinking that explains why models remember, forget, or fail under long context.

Why Architectural Labels Stop Being Useful

Most discussions about sequence models revolve around architectural families—Transformers, RNNs, state‑space models, linear attention, etc. While these labels are historically useful, they often hide the real reasons models behave the way they do.

Two models with very different architectures can fail for the same reason.
Two seemingly similar models can behave very differently under long context.

The MIRAS perspective starts from a simple shift: instead of asking “what architecture is this?” we ask “what kind of memory system is this model implementing?” Once you adopt that lens, many long‑context failures stop looking mysterious and start looking inevitable.

Memory as a System, Not a Side Effect

At a high level, any system that processes sequences over time must answer four questions—explicitly or implicitly:

How does information get written into memory?
How is information retrieved later?
What gets forgotten, and when?
How is memory updated as new data arrives?

Traditional models answer these questions indirectly.

Recurrent models write by compressing history into a hidden state and read by exposing that state at the next step.
Transformers write by appending tokens into the context and read by attending over them.
Forgetting happens automatically when context limits are exceeded or when compression loses detail.

MIRAS makes these mechanisms explicit and treats them as design choices, not side effects.

The Four MIRAS Design Knobs

MIRAS (Memory‑Informed Recurrent Associative Systems) characterizes sequence models using four core components. They are not tied to any single architecture; they describe how memory behaves.

Design knob	What it defines
Memory structure	The form memory takes (vector, matrix, richer neural network, etc.). Fixed‑size structures force compression; richer structures allow selective retention.
Attentional bias	What the model considers relevant. In Transformers this is typically dot‑product similarity. The choice strongly influences what gets retrieved and what gets ignored, especially in noisy or long sequences.
Retention / forgetting mechanism	Whether forgetting is controlled and adaptive, or implicit and uncontrolled. Forgetting is a necessity, not a flaw.
Memory update rule	How memory changes over time. Some models update memory only during training; others allow controlled updates during inference.

MIRAS framework control panel
Illustration showing the four MIRAS dimensions: memory structure, attentional bias, retention, and update rule.

Reinterpreting Familiar Models Through MIRAS

Viewing common architectures through the MIRAS lens clarifies their strengths and weaknesses.

Transformers
- Memory structure: full context window (vibrant).
- Attentional bias: similarity‑based attention.
- Retention: crude—once the window is full, older information disappears entirely.
- Update rule: static during inference.
Linear‑attention & state‑space models
- Modify structure and update rules for efficiency, but rely on aggressive compression.
- This explains why they scale well yet struggle with precise recall over very long sequences.

The key insight: these trade‑offs are not accidental; they follow directly from the memory‑design choices each model makes.

Why Loss Functions and Objectives Matter

A subtle but important point in MIRAS is that memory behaviour is influenced not only by architecture but also by the objective being optimized.

Many models rely heavily on mean‑squared‑error‑like objectives or similarity‑based losses.
Such losses are sensitive to noise and outliers, which in turn affect which memory updates are emphasized.

MIRAS uses this observation to motivate alternative formulations that change how relevance and stability are defined. The result is not just better robustness, but more predictable memory behaviour under long and noisy inputs.

Takeaway: memory is not just where information is stored; it is also shaped by the learning signals that decide what is kept.

Why This Framework Matters Before Talking About Titans

Without a framework like MIRAS, “Titans” (test‑time updates, surprise signals, adaptive forgetting, etc.) can look like a collection of clever tricks. With MIRAS, those choices become legible—they are answers to explicit memory‑design questions rather than ad‑hoc optimisations.

Part 1 showed that attention alone cannot serve as long‑term memory. Part 2 explains why most existing alternatives still fall short. Only after this framing does it make sense to examine Titans as a concrete instantiation of a different memory system.

What to Watch for in Real Apps

Memory structure: Does the system expose a rich enough representation for the task?
Attentional bias: Are similarity metrics appropriate for the data distribution?
Retention policy: Is forgetting controlled, or does it happen unintentionally when the context overflows?
Update rule: Can the model adapt its memory during inference, or is it frozen after training?

Keeping these four knobs in mind will help you diagnose and improve long‑context performance in production settings.

Implications

If you apply the MIRAS lens to real systems, patterns emerge quickly. Models fail when the memory structure is too rigid, when retention is uncontrolled, or when update rules are frozen despite changing inputs. Conversely, systems become more robust when memory design is intentional and aligned with task requirements.

This perspective is especially relevant for agents, streaming data, long‑running processes, and any application where the model must operate continuously rather than in isolated prompts.

Looking Ahead to Part 3

Part 2 sets the conceptual groundwork. In Part 3, we will look closely at the Titans’ architecture and see how it instantiates these memory principles in practice. We will examine how long‑term memory is represented, how it updates during inference, and how forgetting is managed to keep the system stable.

Part 2: Why Transformers Still Forget

Part 2 – Why Long‑Context Language Models Still Struggle with Memory

Why Architectural Labels Stop Being Useful

Memory as a System, Not a Side Effect

The Four MIRAS Design Knobs

Reinterpreting Familiar Models Through MIRAS

Why Loss Functions and Objectives Matter

Why This Framework Matters Before Talking About Titans

What to Watch for in Real Apps

Implications

Looking Ahead to Part 3

Related posts

Generative AI: Transforming the Future of Technology

Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models

I Asked for a Parrot. The AI Gave Me a Crow and Set It Free.

When Neural Networks Stop Learning: Understanding Vanishing Gradients

Part 2 – Why Long‑Context Language Models Still Struggle with Memory

Why Architectural Labels Stop Being Useful

Memory as a System, Not a Side Effect

The Four MIRAS Design Knobs

Reinterpreting Familiar Models Through MIRAS

Why Loss Functions and Objectives Matter

Why This Framework Matters Before Talking About Titans

What to Watch for in Real Apps

Implications

Looking Ahead to Part 3

Related posts

Generative AI: Transforming the Future of Technology

Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models

I Asked for a Parrot. The AI Gave Me a Crow and Set It Free.

When Neural Networks Stop Learning: Understanding Vanishing Gradients

Part 2 – Why Long‑Context Language Models Still Struggle with Memory

Looking Ahead to Part 3