Building an RLM with Mastra: Introducing mastra-rlm-kit

Published: (February 16, 2026 at 06:46 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

![Cover image for Building an RLM with Mastra: Introducing mastra-rlm-kit](https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftb41969yfpy8b6a7ai1q.png)

[![Alvaro Fragoso](https://media2.dev.to/dynamic/image/width=50,height=50,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F766603%2Ff4b1e396-2875-4d67-86ae-31ec57c251c6.png)](https://dev.to/metasurfero)

## TL;DR

I just open‑sourced **mastra-rlm-kit**, a paper‑faithful implementation of *Recursive Language Models (RLMs)* for Mastra.

It lets agents:

- break complex tasks into executable Python steps  
- spawn recursive and batched sub‑queries  
- ground reasoning in code instead of vibes  
- produce full, inspectable audit trails  

This isn’t a prompt trick. It’s an architecture.

👉 GitHub:   
👉 npm: `npm install mastra-rlm-kit`

---

## The Problem: Agents Are Still Bad at Thinking

If you’ve built agents with Mastra (or LangGraph, CrewAI, AutoGen…), you’ve probably tried something like:

> “Given earnings reports, analyst notes, and news articles that don’t fit in a single context window, analyze renewable energy stocks in Q3 2024, compare them to traditional energy, and give me a recommendation.”

**What happens?**

- the model silently drops context  
- key documents are ignored  
- comparisons are incomplete or superficial  
- the final answer sounds confident but isn’t grounded  

Not because the model is weak — but because **the agent architecture is**.

Most agents still assume:

- one prompt  
- one context window  
- one response  

That breaks down immediately once the task exceeds context limits or requires verification.

---

## The Core Insight: Reasoning Needs Structure

In 2024, Chen *et al.* introduced **Recursive Language Models (RLMs)** with a simple but powerful idea:

> Don’t ask the model to reason in one pass.  
> Force it to reason *step by step*, with execution and recursion.

An RLM works like this:

1. A **root model** decomposes the task into steps.  
2. Each step can execute **Python code**.  
3. When more information is needed, it spawns **recursive sub‑queries**.  
4. Sub‑queries can run **in parallel**.  
5. Every action is logged and auditable.

Instead of hoping the model reasons correctly inside a single context window, you **externalize the reasoning process**.

---

## What `mastra-rlm-kit` Brings to Mastra

Mastra already has workflows, observability, and strong TypeScript ergonomics.  
What it didn’t have was **serious reasoning**.

`mastra-rlm-kit` adds that missing layer with three main exports:

| Export | Purpose |
|--------|---------|
| `createRlmTool()` | Expose RLM as a callable tool |
| `createRlmWorkflow()` | Build full recursive reasoning pipelines |
| `createRlmRunner()` | Low‑level, programmatic control |

This isn’t a “conceptual” RLM — it’s **paper‑faithful** and production‑oriented.

---

## Key Features

-**Paper‑faithful RLM implementation**  
- 🔁 **Recursive sub‑queries** via `llm_query()` and `llm_query_batched()`  
  - ⚡ Parallel exploration with batched calls  
- 🧪 **Grounded reasoning** via sandboxed Python REPL  
- 📜 **Deterministic artifacts**: output, events, audit log, recursion tree  
- 🔌 **Model‑agnostic**: works with any Mastra‑compatible model  

Every run leaves a trail you can inspect, debug, and trust.

---

## Quick Start

```bash
npm install mastra-rlm-kit @mastra/core zod

Use It as a Tool

import { createRlmTool } from "mastra-rlm-kit";

export const runRlmTool = createRlmTool({
  workspace,
  defaults: {
    rootModelId: "openrouter/moonshotai/kimi-k2.5",
    subModelId: "openrouter/minimax/minimax-m2.5",
    budgets: {
      maxIterations: 30,
      maxCalls: 50,
      maxDepth: 1,
      maxOutputChars: 10_000,
    },
  },
});

Or as a Workflow

import { createRlmWorkflow } from "mastra-rlm-kit";

export const rlmWorkflow = createRlmWorkflow({
  workspace,
  models: {
    root: { id: "openrouter/moonshotai/kimi-k2.5" },
    sub: { id: "openrouter/minimax/minimax-m2.5" },
  },
  defaults: {
    budgets: {
      maxIterations: 30,
      maxCalls: 50,
      maxDepth: 1,
      maxOutputChars: 10_000,
    },
  },
});

Where RLMs Actually Shine

Use CaseWhy RLM Helps
Long‑context tasksBreak work across recursive calls instead of one window
Multi‑hop Q&AEach hop is a traceable sub‑query
Math & logicPython executes and verifies reasoning
Data analysisIntermediate states are inspectable
Research synthesisParallel sub‑queries before synthesis

If the task exceeds a single context window or requires verification, RLMs win.


A Note on Benchmarks

mastra-rlm-kit includes strict, reproducible benchmarks — but they’re not the headline feature.

All benchmark runs:

  • use datasets as‑is (no rewritten questions or labels)
  • run the RLM loop without prompt tuning
  • score outputs using official exact‑match metrics

Current Results (OolongBench)

On a recent OolongBench validation slice:

  • Accuracy: 20 % (exact match)
  • Completion rate: 100 %
  • Avg. sub‑queries: ~8 per task

Many failures are near‑misses (off‑by‑one values, partial lists, non‑canonical names), which are not counted as correct by design.


Why This Is Still Useful

These results aren’t about leaderboard performance. They show that RLMs:

  • execute multi‑step reasoning reliably
  • fail deterministically (no silent hallucinations)
  • produce full traces you can inspect and improve

Full benchmark commands and reports live in the repository.


How It Works

(Further implementation details, architecture diagrams, and runtime flow are described in the repository README and documentation.)


## Internally  

- **Root model receives the task**  
- It writes **Python REPL steps**  
- Steps execute and store intermediate results  
- Missing info → spawn `llm_query()` sub‑queries  
- Sub‑queries batch and parallelize  
- Results aggregate into a final synthesis  
- Full trace is persisted  

Every claim is either  

- **executed code**, or  
- **traceable recursive output**  

That’s how hallucinations die.  

---  

## Why Mastra Was the Right Fit  

Mastra already gets the fundamentals right:  

- TypeScript‑first  
- Built‑in observability  
- Clean workflow primitives  
- Model‑agnostic via Vercel AI SDK  

RLMs don’t replace Mastra — they **complete it**.  

---  

## Final Thought  

The gap between *agents that talk* and *agents that think* is still massive.  

Most demos fall apart the moment you ask for:  

- long‑context reasoning  
- verification  
- decomposition  
- accountability  

`mastra-rlm-kit` doesn’t add magic.  
It adds **structure, execution, and transparency**.  

Try it. Break it. Improve it.  

And tell me what you build.  

— Built by **[@metasurfero](https://dev.to/metasurfero)**
0 views
Back to Blog

Related posts

Read more »

Preface

Motivation I wanted to record my studies to have consistency. Since I don't directly learn building projects from my CS program, I want to be an expert in my a...