Building an MCP Server for AI-Native Data Discovery: Rust Crates Ecosystem: Part I

Published: 2 months ago (December 4, 2025 at 06:53 AM EST)

4 min read

Source: Building an MCP Server for AI‑Native Data Discovery – Rust Crates Ecosystem (Part I)

Introduction

Traditional data exploration relies on predefined queries—SQL code, dashboards, or BI tools.
What if you could explore a data warehouse conversationally, asking open‑ended questions and letting an AI discover patterns you never thought to look for?

The Model Context Protocol (MCP) makes this possible. I built an MCP server for analyzing the Rust ecosystem and let Claude explore it, answering questions about crates, dependencies, and trends.

If you want to follow along, see the Rust Crates Analytics Repo for the full project.

Understanding the crates.io Data

When you download and extract the crates.io DB dump (see the download link in the README), you receive:

A PostgreSQL dump with instructions for loading it locally.
A data folder containing CSV files that represent the actual contents.

Entity Overview

Entity	Description
`crates`	Rust packages published to crates.io
`versions`	Specific releases of a crate (e.g., `serde v1.0.228`)
`categories`	Taxonomic classifications (e.g., `science::bioinformatics`)
`keywords`	User‑defined tags for discoverability (e.g., `cargo`, `sql`)
`teams`	Organizational accounts that can own crates (GitHub)
`users`	Individual developer accounts (GitHub)

Fact Tables

version_downloads – Time‑series download counts per version per day (last 3 months only).
crate_downloads – All‑time total download counts per crate.

Junction Tables

crates_categories – Links crates to categories.
crate_owners – Links crates to users or teams.
crates_keywords – Links crates to keywords.
dependencies – Links versions to the crates they depend on.

Support Tables

metadata – Contains a single row with total_downloads.
reserved_crate_names – List of protected/unavailable crate names.
default_versions – Links crates to their default version.

Historical Data

The version_downloads table only holds the most recent three months of data.
Older daily CSV dumps (available on the crates.io site) contain version_id and downloads columns; the date must be inferred from the filename.
All other tables represent the state of the ecosystem at the moment of the dump (e.g., crates.created_at, crates.updated_at).

Key Takeaways

The DB dump is refreshed daily, reflecting the ecosystem at download time.
Each day, one day’s worth of data is removed from the three‑month window and placed in the CSV archives.
The most important tables for analytics are crates, versions, version_downloads, and dependencies.

Architecture Overview

ELT Pipeline Design

Layer	Purpose
raw	Direct CSV loads from the crates.io dumps
staging	Cleaned and validated data (tables prefixed with `stg_`)
marts	Analytics‑ready tables (not covered in this post)

Extract

Download the crates.io DB dump (.tar.gz).
Extract the CSV files from the archive.

Load

Import all CSVs into the raw schema using a full‑refresh strategy.
Each new dump replaces the raw tables, providing a clean snapshot of the current state.

Transform

Apply data‑quality rules in the staging schema:
- Normalize all timestamps to UTC.
- Incrementally load stg_version_downloads (add new dates only).
  - The first run ingests every available date.
  - Subsequent runs add only newly‑available data.
- Perform a full refresh on dimension tables (categories, crates, versions, …) to capture updates.
- Enforce data contracts and run quality tests.

This approach efficiently handles a three‑month rolling window while preserving a complete historical archive in stg_version_downloads.

Backfilling Historical Downloads

Historical archives exist from 2014‑11‑11 onward.
A backfill script (parameterized by start/end dates) ingests older CSVs directly into stg_version_downloads.
Snapshots of crates, dependencies, and versions are also taken to capture changes between dumps.

Constraints & Technology Stack

Constraint	Solution
No infrastructure overhead	DuckDB – embedded, single‑file OLAP database
Cross‑platform	Python + uv (fast package manager)
Fast iteration	dbt for SQL transformations, testing, and snapshots
Storage efficient	DuckDB file ~10 GB for data from 2014‑11‑11 to 2025‑11‑29; runs on a laptop (≈8 GB RAM, ≈20 GB free disk)
Visualization	Streamlit for quick data‑validation dashboards

All project dependencies are managed via uv; the only prerequisite is having uv installed.

Investigating Orphan Versions

One interesting data‑quality check involved orphan versions in the stg_version_downloads table—entries where the version_id does not exist in the versions table.

Investigation Steps

Identify orphan rows

SELECT vd.version_id
FROM stg_version_downloads vd
LEFT JOIN stg_versions v
  ON vd.version_id = v.id
WHERE v.id IS NULL
LIMIT 100;

Quantify the issue

SELECT COUNT(*) AS orphan_count
FROM stg_version_downloads vd
LEFT JOIN stg_versions v
  ON vd.version_id = v.id
WHERE v.id IS NULL;

Root‑cause analysis
- Orphans often arise from crates that were yanked or deleted after the download snapshot.
- The three‑month window may contain download records for versions that no longer exist in the current versions table.
Resolution strategy
- Keep orphan rows for historical completeness, but flag them in downstream analyses.
- Optionally, maintain a separate “historical versions” table that stores metadata for yanked/deleted versions.

This check became part of the automated dbt tests, ensuring future loads surface any new orphan records.

Conclusion

By combining a lightweight ELT pipeline (Python + uv + DuckDB + dbt) with the Model Context Protocol, you can transform the Rust crates.io dataset into an AI‑explorable knowledge base.

Resource‑efficient: The architecture respects strict resource constraints.
Historical depth: Preserves a full view of downloads and dependencies over time.
Analytical power: Enables investigations such as orphan‑version detection and broader ecosystem analyses.

Building an MCP Server for AI-Native Data Discovery: Rust Crates Ecosystem: Part I

Introduction

Understanding the crates.io Data

Entity Overview

Fact Tables

Junction Tables

Support Tables

Historical Data

Key Takeaways

Architecture Overview

ELT Pipeline Design

Extract

Load

Transform

Backfilling Historical Downloads

Constraints & Technology Stack

Investigating Orphan Versions

Investigation Steps

Conclusion

Related posts

Rust runtime now in public beta for Vercel Functions

iced 0.14 has been released (Rust GUI library)

🎮 brkrs: A Brand New Take on Classic Brick-Breaking – Play It, Tweak It, Own It!

PyCrucible - python embedder written in rust

Introduction

Understanding the crates.io Data

Entity Overview

Fact Tables

Junction Tables

Support Tables

Historical Data

Key Takeaways

Architecture Overview

ELT Pipeline Design

Extract

Load

Transform

Backfilling Historical Downloads

Constraints & Technology Stack

Investigating Orphan Versions

Investigation Steps

Conclusion

Related posts

Rust runtime now in public beta for Vercel Functions

iced 0.14 has been released (Rust GUI library)

🎮 *brkrs*: A Brand New Take on Classic Brick-Breaking – Play It, Tweak It, Own It!

PyCrucible - python embedder written in rust

🎮 brkrs: A Brand New Take on Classic Brick-Breaking – Play It, Tweak It, Own It!