Databricks built a RAG agent it says can handle every kind of enterprise search
Source: VentureBeat
Most enterprise RAG pipelines are optimized for one search behavior
They fail silently on the others. A model trained to synthesize cross‑document reports handles constraint‑driven entity search poorly. A model tuned for simple lookup tasks falls apart on multi‑step reasoning over internal notes. Most teams find out when something breaks.
KARL: Knowledge Agents via Reinforcement Learning
Databricks set out to fix that with KARL, short for Knowledge Agents via Reinforcement Learning. The company trained an agent across six distinct enterprise search behaviors simultaneously using a new reinforcement‑learning algorithm.
- Result: a model that matches Claude Opus 4.6 on a purpose‑built benchmark at 33 % lower cost per query and 47 % lower latency.
- Training data: entirely synthetic, generated by the agent itself with no human labeling required.
- Evaluation: based on KARLBench, a benchmark Databricks built to evaluate enterprise search behaviors.
“A lot of the big reinforcement learning wins that we’ve seen in the community in the past year have been on verifiable tasks where there is a right and a wrong answer,”
— Jonathan Frankle, Chief AI Scientist at Databricks, in an exclusive VentureBeat interview.“The tasks that we’re working on for KARL, and that are just normal for most enterprises, are not strictly verifiable in that same way.”
Representative Enterprise Tasks
- Synthesizing intelligence across product‑manager meeting notes.
- Reconstructing competitive deal outcomes from fragmented customer records.
- Answering questions about account history where no single document contains the full answer.
- Generating battle cards from unstructured internal data.
“Doing reinforcement learning in a world where you don’t have a strict right and wrong answer, and figuring out how to guide the process and make sure reward hacking doesn’t happen — that’s really non‑trivial,” Frankle said.
“Very little of what companies do day to day on knowledge tasks are verifiable.”
The Generalization Trap in Enterprise RAG
Standard RAG breaks down on ambiguous, multi‑step queries that draw on fragmented internal data never designed to be queried.
KARLBench Benchmark
To evaluate KARL, Databricks built KARLBench, measuring performance across six enterprise search behaviors:
- Constraint‑driven entity search
- Cross‑document report synthesis
- Long‑document traversal with tabular numerical reasoning
- Exhaustive entity retrieval
- Procedural reasoning over technical documentation
- Fact aggregation over internal company notes (the PMBench task, built from Databricks’ own product‑manager meeting notes)
Training on any single task and testing on the others yields poor results. The KARL paper shows that multi‑task RL generalizes in ways single‑task training does not. The team trained KARL on synthetic data for two of the six tasks and found it performed well on all four it had never seen.
Example: To build a competitive battle card for a financial‑services customer, the agent must identify relevant accounts, filter for recency, reconstruct past competitive deals, and infer outcomes—none of which is labeled anywhere in the data.
Frankle calls what KARL does “grounded reasoning”: running a difficult reasoning chain while anchoring every step in retrieved facts.
“You can think of this as RAG, but like RAG ++++++, all the way up to 200 vector‑database calls.”
The RL Engine: Why OAPL Matters
KARL’s training is powered by OAPL (Optimal Advantage‑based Policy Optimization with Lagged Inference). It was developed jointly by researchers from Cornell, Databricks, and Harvard and published in a separate paper the week before KARL.
- Standard LLM RL uses on‑policy algorithms like GRPO (Group Relative Policy Optimization), which assume the model generating training data and the model being updated are in sync. In distributed training they never are.
- Prior fixes (importance sampling) introduced variance and instability.
- OAPL embraces the off‑policy nature of distributed training, using a regression objective that stays stable with policy lags of > 400 gradient steps—about 100 × more off‑policy than prior approaches handled.
In code‑generation experiments, OAPL matched a GRPO‑trained model while using roughly three times fewer training samples.
Sample Efficiency
OAPL’s efficiency kept the training budget accessible: reusing previously collected rollouts rather than requiring fresh on‑policy data for every update meant the full KARL training run stayed within a few thousand GPU hours—the difference between a research project and something an enterprise team can realistically attempt.
Agents, Memory, and the Context Stack
There has been a lot of discussion recently about replacing RAG with contextual memory (sometimes called agentic memory).
- Frankle’s view: not an either/or debate, but a layered stack.
- Base layer: a vector database with millions of entries (too large for direct context).
- Top layer: the LLM’s context window.
- Middle layers: compression and caching mechanisms that decide how much of what an agent has already learned can be carried forward.
KARL in Practice
Some KARLBench tasks required 200 sequential vector‑database queries, with the agent refining searches, verifying details, and cross‑referencing documents before committing to an answer—exhausting the context window many times over.
Instead of training a separate summarization model, the team let KARL learn compression end‑to‑end through RL: when context grew too large, the agent compressed it and continued, with the only training signal being the reward at the end of the task.
The content above preserves the original information while organizing it into a clean, readable Markdown structure.
KARL: Performance Highlights & Limitations
“We just let the model figure out how to compress its own context,” Frankle said. “And this worked phenomenally well.”
Where KARL Falls Short
- Ambiguity handling – KARL struggles most with questions that have significant ambiguity, where multiple valid answers exist and the model cannot determine whether the query is genuinely open‑ended or simply hard to answer. This judgment call remains an unsolved problem.
- Early termination – The model sometimes “gives up” early, stopping before producing a final answer. Frankle cautions against labeling this a failure, noting that the most expensive queries are typically the ones the model gets wrong anyway; stopping can often be the right call.
- Scope of training – KARL was trained and evaluated exclusively on vector search. Tasks that require SQL queries, file search, or Python‑based calculations are not yet supported. Frankle indicated that these capabilities are on the roadmap but are not part of the current system.
What This Means for Enterprise Data Teams
KARL surfaces three decisions worth revisiting for teams evaluating their retrieval infrastructure.
1. Pipeline Architecture
- If your RAG (Retrieval‑Augmented Generation) agent is optimized for a single search behavior, KARL’s results suggest it will fail on others.
- Multi‑task training across diverse retrieval behaviors produces models that generalize.
- Narrow pipelines do not.
2. Why Reinforcement Learning (RL) Matters
- Databricks tested an alternative: distilling from expert models via supervised fine‑tuning.
- This improved in‑distribution performance but yielded negligible gains on unseen tasks.
- RL developed general search behaviors that transferred to new tasks.
- For enterprise teams facing heterogeneous data and unpredictable query types, this distinction is the whole game.
3. Interpreting RL Efficiency in Practice
- A model trained to search better:
- Completes tasks in fewer steps.
- Stops earlier on queries it cannot answer.
- Diversifies its search rather than repeating failed queries.
- Compresses its own context instead of running out of room.
The argument for training purpose‑built search agents—rather than routing everything through general‑purpose frontier APIs—is not primarily about cost. It is about building a model that knows how to do the job.