SQL query logs hold the context AI agents need to stop hallucinating joins

Published: (May 28, 2026 at 11:00 AM EDT)
5 min read

Source: VentureBeat

Miro’s AI Agent Struggle with Snowflake

When Miro’s data team pointed AI agents directly at its Snowflake environment, the agents got the wrong answer more than 65 % of the time. The problem wasn’t the model — it was context. With more than 10,000 tables and no semantic layer to guide routing, the agents had no way to know which data assets matched which business questions.

DataHub is releasing a Context Intelligence layer (Thursday) that mines existing SQL query history to build a semantic index — and exposes it to agents via MCP, LangChain, Google’s Agent Development Kit, and CrewAI. The company calls it Context Intelligence, and it’s built on the same query‑log infrastructure DataHub has used for lineage tracking in production deployments worldwide.

Who’s Behind DataHub?

  • Founders: The team that built DataHub as an open‑source project at LinkedIn.
  • Co‑founder & CTO: Shirshanka Das, who led LinkedIn’s data infrastructure for nearly 11 years.
  • Open‑source impact: >15,000 contributors and >3,000 production deployments worldwide.

For the first time, enterprises can turn years of analyst query history into a living, retrievable knowledge base where agents stop hallucinating joins because they have access to the joins that have worked before, validated by the people who ran them,”
Shirshanka Das, co‑founder and CTO of DataHub (VentureBeat exclusive).

Why Query History Beats Raw Schema for Agent Routing

DataHub began as a metadata‑management project at LinkedIn, built to solve two problems simultaneously:

  1. Make data easy to find and use across the organization.
  2. Ensure data is used for the right reasons (governance, compliance).

Das open‑sourced the project in early 2020 after nearly six years of internal development.

Primary Use Cases Since Launch

  • Lineage: Understanding how data flows from operational systems → streaming infrastructure → warehouses → business tools.
  • Regulatory compliance audits
  • Operational triage
  • New engineer onboarding

Most‑connected source in the global DataHub deployment base: Postgres, followed by MySQL, Oracle, and major cloud warehouses (Snowflake, Google BigQuery).
The platform now supports >100 connected metadata sources.

The Release Context

The query‑log extraction and SQL‑parsing capabilities powering Context Intelligence were developed across years of production deployment, not built just for this release. The same infrastructure now serves agents querying a semantic index at runtime.

The consumption layer has changed from humans to agents,” – Das.

Context Intelligence: Mining Validated Query History, Not Raw Logs

What It Is

  • A new capability layer built on top of DataHub’s existing open‑source metadata foundation.
  • Leverages years‑old infrastructure that extracts and parses query logs for lineage tracking.

How It Works

StepDescription
1️⃣ Filtering for signalWarehouse query logs contain a lot of noise. DataHub filters for “golden queries” — high‑quality analyst queries and scheduled pipelines that represent proven business logic.
2️⃣ Inverting SQL into semantic definitionsPatterns from golden queries are translated into semantic anchors (structured text definitions). These anchors become the retrieval basis agents draw on before generating SQL.
3️⃣ Human validation on topContext Hub lets domain experts review AI‑proposed context, resolve conflicting definitions, and simulate impact before publishing. DataHub surfaces cases where different teams calculate the same metric differently and raises them for human resolution.

You can almost think of it as inverting text to SQL,” – Das

How Miro Got AI Agents Working Across 10,000 Snowflake Tables

  • Background: Miro already used DataHub for lineage tracking and impact analysis.
  • Problem: Direct natural‑language queries to Snowflake’s MCP produced incorrect answers >65 % of the time. Exposing >10,000 tables directly to agents caused massive routing confusion.

Solution

  1. Organize data into well‑defined data products that constrain what agents can see, rather than exposing raw schema.
  2. Production architecture:
    • User requests → Claude Chat / Claude CoworkContext layer (DataHub’s MCP maps NL to appropriate data assets) → Snowflake MCP for SQL generation.

The context layer pulls in metadata, entity relationships, query history and business intent for each Snowflake table, specifically what business question each entity is designed to answer,” – Ronald Angel, Product Manager, Data Platform, Miro.

These semantic signals let the agent identify the correct database entities before writing SQL, eliminating guesswork from schema alone.

Where DataHub Fits in the Wider Context Stack

Vendor / PlatformOfferingRelationship to DataHub
PineconeVector store with contextual memoryDataHub can feed semantic anchors into Pinecone for retrieval.
OracleDatabase + AI servicesDataHub can enrich Oracle metadata with query‑history context.
RedisIn‑memory store with vector capabilitiesActs as a fast cache for semantic anchors.
Microsoft Fabric IQSemantic layer for contextDataHub positions its layer as platform‑neutral, provisioning context into existing endpoints like Fabric IQ rather than replacing them.

A lot of times people want to be platform neutral when it comes to their context layer,” – Das.

Kevin Petrie, analyst at BARC, told VentureBeat that he sees DataHub’s ability to integrate diverse metadata for both structured and unstructured data as a key differentiator in the emerging context‑intelligence market.

Context‑Driven Data Management

“Many other vendors are more focused on structured tables, which provide trusted facts but often lack the rich context of text objects,” said Michael Ni, VP and principal analyst at Constellation Research.

Ni highlighted that DataHub’s context layer shifts from passive cataloging to continuously refreshed semantic intelligence. He argued that whoever controls context at runtime controls the decision layer for data, agents, workflows, and decisions.

“Buyers need to be careful, since many vendors only support a portion of the full context capabilities required for AI and agentic solutions,” Ni said. “Buyers should be clear on their context management requirements, as vector memory isn’t business meaning, business meaning isn’t governance, and governance isn’t execution.”

0 views
Back to Blog

Related posts

Read more »