Building an RLM with Mastra: Introducing mastra-rlm-kit
Source: Dev.to

[](https://dev.to/metasurfero)
## TL;DR
I just open‑sourced **mastra-rlm-kit**, a paper‑faithful implementation of *Recursive Language Models (RLMs)* for Mastra.
It lets agents:
- break complex tasks into executable Python steps
- spawn recursive and batched sub‑queries
- ground reasoning in code instead of vibes
- produce full, inspectable audit trails
This isn’t a prompt trick. It’s an architecture.
👉 GitHub:
👉 npm: `npm install mastra-rlm-kit`
---
## The Problem: Agents Are Still Bad at Thinking
If you’ve built agents with Mastra (or LangGraph, CrewAI, AutoGen…), you’ve probably tried something like:
> “Given earnings reports, analyst notes, and news articles that don’t fit in a single context window, analyze renewable energy stocks in Q3 2024, compare them to traditional energy, and give me a recommendation.”
**What happens?**
- the model silently drops context
- key documents are ignored
- comparisons are incomplete or superficial
- the final answer sounds confident but isn’t grounded
Not because the model is weak — but because **the agent architecture is**.
Most agents still assume:
- one prompt
- one context window
- one response
That breaks down immediately once the task exceeds context limits or requires verification.
---
## The Core Insight: Reasoning Needs Structure
In 2024, Chen *et al.* introduced **Recursive Language Models (RLMs)** with a simple but powerful idea:
> Don’t ask the model to reason in one pass.
> Force it to reason *step by step*, with execution and recursion.
An RLM works like this:
1. A **root model** decomposes the task into steps.
2. Each step can execute **Python code**.
3. When more information is needed, it spawns **recursive sub‑queries**.
4. Sub‑queries can run **in parallel**.
5. Every action is logged and auditable.
Instead of hoping the model reasons correctly inside a single context window, you **externalize the reasoning process**.
---
## What `mastra-rlm-kit` Brings to Mastra
Mastra already has workflows, observability, and strong TypeScript ergonomics.
What it didn’t have was **serious reasoning**.
`mastra-rlm-kit` adds that missing layer with three main exports:
| Export | Purpose |
|--------|---------|
| `createRlmTool()` | Expose RLM as a callable tool |
| `createRlmWorkflow()` | Build full recursive reasoning pipelines |
| `createRlmRunner()` | Low‑level, programmatic control |
This isn’t a “conceptual” RLM — it’s **paper‑faithful** and production‑oriented.
---
## Key Features
- ✅ **Paper‑faithful RLM implementation**
- 🔁 **Recursive sub‑queries** via `llm_query()` and `llm_query_batched()`
- ⚡ Parallel exploration with batched calls
- 🧪 **Grounded reasoning** via sandboxed Python REPL
- 📜 **Deterministic artifacts**: output, events, audit log, recursion tree
- 🔌 **Model‑agnostic**: works with any Mastra‑compatible model
Every run leaves a trail you can inspect, debug, and trust.
---
## Quick Start
```bash
npm install mastra-rlm-kit @mastra/core zod
Use It as a Tool
import { createRlmTool } from "mastra-rlm-kit";
export const runRlmTool = createRlmTool({
workspace,
defaults: {
rootModelId: "openrouter/moonshotai/kimi-k2.5",
subModelId: "openrouter/minimax/minimax-m2.5",
budgets: {
maxIterations: 30,
maxCalls: 50,
maxDepth: 1,
maxOutputChars: 10_000,
},
},
});
Or as a Workflow
import { createRlmWorkflow } from "mastra-rlm-kit";
export const rlmWorkflow = createRlmWorkflow({
workspace,
models: {
root: { id: "openrouter/moonshotai/kimi-k2.5" },
sub: { id: "openrouter/minimax/minimax-m2.5" },
},
defaults: {
budgets: {
maxIterations: 30,
maxCalls: 50,
maxDepth: 1,
maxOutputChars: 10_000,
},
},
});
Where RLMs Actually Shine
| Use Case | Why RLM Helps |
|---|---|
| Long‑context tasks | Break work across recursive calls instead of one window |
| Multi‑hop Q&A | Each hop is a traceable sub‑query |
| Math & logic | Python executes and verifies reasoning |
| Data analysis | Intermediate states are inspectable |
| Research synthesis | Parallel sub‑queries before synthesis |
If the task exceeds a single context window or requires verification, RLMs win.
A Note on Benchmarks
mastra-rlm-kit includes strict, reproducible benchmarks — but they’re not the headline feature.
All benchmark runs:
- use datasets as‑is (no rewritten questions or labels)
- run the RLM loop without prompt tuning
- score outputs using official exact‑match metrics
Current Results (OolongBench)
On a recent OolongBench validation slice:
- Accuracy: 20 % (exact match)
- Completion rate: 100 %
- Avg. sub‑queries: ~8 per task
Many failures are near‑misses (off‑by‑one values, partial lists, non‑canonical names), which are not counted as correct by design.
Why This Is Still Useful
These results aren’t about leaderboard performance. They show that RLMs:
- execute multi‑step reasoning reliably
- fail deterministically (no silent hallucinations)
- produce full traces you can inspect and improve
Full benchmark commands and reports live in the repository.
How It Works
(Further implementation details, architecture diagrams, and runtime flow are described in the repository README and documentation.)
## Internally
- **Root model receives the task**
- It writes **Python REPL steps**
- Steps execute and store intermediate results
- Missing info → spawn `llm_query()` sub‑queries
- Sub‑queries batch and parallelize
- Results aggregate into a final synthesis
- Full trace is persisted
Every claim is either
- **executed code**, or
- **traceable recursive output**
That’s how hallucinations die.
---
## Why Mastra Was the Right Fit
Mastra already gets the fundamentals right:
- TypeScript‑first
- Built‑in observability
- Clean workflow primitives
- Model‑agnostic via Vercel AI SDK
RLMs don’t replace Mastra — they **complete it**.
---
## Final Thought
The gap between *agents that talk* and *agents that think* is still massive.
Most demos fall apart the moment you ask for:
- long‑context reasoning
- verification
- decomposition
- accountability
`mastra-rlm-kit` doesn’t add magic.
It adds **structure, execution, and transparency**.
Try it. Break it. Improve it.
And tell me what you build.
— Built by **[@metasurfero](https://dev.to/metasurfero)**