Part 2: Why Transformers Still Forget
Source: Dev.to
Part 2 – Why Long‑Context Language Models Still Struggle with Memory
(second of a three‑part series)
In Part 1 we saw why simply increasing context length does not solve the memory problem.
Here we introduce a memory‑centric way of thinking that explains why models remember, forget, or fail under long context.
Why Architectural Labels Stop Being Useful
Most discussions about sequence models revolve around architectural families—Transformers, RNNs, state‑space models, linear attention, etc. While these labels are historically useful, they often hide the real reasons models behave the way they do.
- Two models with very different architectures can fail for the same reason.
- Two seemingly similar models can behave very differently under long context.
The MIRAS perspective starts from a simple shift: instead of asking “what architecture is this?” we ask “what kind of memory system is this model implementing?” Once you adopt that lens, many long‑context failures stop looking mysterious and start looking inevitable.
Memory as a System, Not a Side Effect
At a high level, any system that processes sequences over time must answer four questions—explicitly or implicitly:
- How does information get written into memory?
- How is information retrieved later?
- What gets forgotten, and when?
- How is memory updated as new data arrives?
Traditional models answer these questions indirectly.
- Recurrent models write by compressing history into a hidden state and read by exposing that state at the next step.
- Transformers write by appending tokens into the context and read by attending over them.
- Forgetting happens automatically when context limits are exceeded or when compression loses detail.
MIRAS makes these mechanisms explicit and treats them as design choices, not side effects.
The Four MIRAS Design Knobs
MIRAS (Memory‑Informed Recurrent Associative Systems) characterizes sequence models using four core components. They are not tied to any single architecture; they describe how memory behaves.
| Design knob | What it defines |
|---|---|
| Memory structure | The form memory takes (vector, matrix, richer neural network, etc.). Fixed‑size structures force compression; richer structures allow selective retention. |
| Attentional bias | What the model considers relevant. In Transformers this is typically dot‑product similarity. The choice strongly influences what gets retrieved and what gets ignored, especially in noisy or long sequences. |
| Retention / forgetting mechanism | Whether forgetting is controlled and adaptive, or implicit and uncontrolled. Forgetting is a necessity, not a flaw. |
| Memory update rule | How memory changes over time. Some models update memory only during training; others allow controlled updates during inference. |

Illustration showing the four MIRAS dimensions: memory structure, attentional bias, retention, and update rule.
Reinterpreting Familiar Models Through MIRAS
Viewing common architectures through the MIRAS lens clarifies their strengths and weaknesses.
-
Transformers
- Memory structure: full context window (vibrant).
- Attentional bias: similarity‑based attention.
- Retention: crude—once the window is full, older information disappears entirely.
- Update rule: static during inference.
-
Linear‑attention & state‑space models
- Modify structure and update rules for efficiency, but rely on aggressive compression.
- This explains why they scale well yet struggle with precise recall over very long sequences.
The key insight: these trade‑offs are not accidental; they follow directly from the memory‑design choices each model makes.
Why Loss Functions and Objectives Matter
A subtle but important point in MIRAS is that memory behaviour is influenced not only by architecture but also by the objective being optimized.
- Many models rely heavily on mean‑squared‑error‑like objectives or similarity‑based losses.
- Such losses are sensitive to noise and outliers, which in turn affect which memory updates are emphasized.
MIRAS uses this observation to motivate alternative formulations that change how relevance and stability are defined. The result is not just better robustness, but more predictable memory behaviour under long and noisy inputs.
Takeaway: memory is not just where information is stored; it is also shaped by the learning signals that decide what is kept.
Why This Framework Matters Before Talking About Titans
Without a framework like MIRAS, “Titans” (test‑time updates, surprise signals, adaptive forgetting, etc.) can look like a collection of clever tricks. With MIRAS, those choices become legible—they are answers to explicit memory‑design questions rather than ad‑hoc optimisations.
Part 1 showed that attention alone cannot serve as long‑term memory. Part 2 explains why most existing alternatives still fall short. Only after this framing does it make sense to examine Titans as a concrete instantiation of a different memory system.
What to Watch for in Real Apps
- Memory structure: Does the system expose a rich enough representation for the task?
- Attentional bias: Are similarity metrics appropriate for the data distribution?
- Retention policy: Is forgetting controlled, or does it happen unintentionally when the context overflows?
- Update rule: Can the model adapt its memory during inference, or is it frozen after training?
Keeping these four knobs in mind will help you diagnose and improve long‑context performance in production settings.
Implications
If you apply the MIRAS lens to real systems, patterns emerge quickly. Models fail when the memory structure is too rigid, when retention is uncontrolled, or when update rules are frozen despite changing inputs. Conversely, systems become more robust when memory design is intentional and aligned with task requirements.
This perspective is especially relevant for agents, streaming data, long‑running processes, and any application where the model must operate continuously rather than in isolated prompts.
Looking Ahead to Part 3
Part 2 sets the conceptual groundwork. In Part 3, we will look closely at the Titans’ architecture and see how it instantiates these memory principles in practice. We will examine how long‑term memory is represented, how it updates during inference, and how forgetting is managed to keep the system stable.