Building an MCP Server for AI-Native Data Discovery: Rust Crates Ecosystem: Part I
Source: Dev.to
Introduction
Traditional data exploration relies on predefined queries—SQL code, dashboards, BI tools.
What if you could explore a data warehouse conversationally, asking open‑ended questions and letting an AI discover patterns you never thought to look for?
The Model Context Protocol (MCP) makes this possible. I built an MCP server for analyzing the Rust ecosystem and let Claude explore it, answering questions about crates, dependencies, and trends.
If you want to follow along, see the Rust Crates Analytics Repo for the full project.
Understanding the crates.io Data
When you download and extract the crates.io DB dump (see the download link in the README), you receive:
- A PostgreSQL dump with instructions for loading it locally.
- A
datafolder containing CSV files that represent the actual contents.
Entity Overview
| Entity | Description |
|---|---|
crates | Rust packages published to crates.io |
versions | Specific releases of a crate (e.g., serde v1.0.228) |
categories | Taxonomic classifications (e.g., science::bioinformatics) |
keywords | User‑defined tags for discoverability (e.g., cargo, sql) |
teams | Organizational accounts that can own crates (GitHub) |
users | Individual developer accounts (GitHub) |
Fact Tables
version_downloads– Time‑series download counts per version per day (last 3 months only).crate_downloads– All‑time total download counts per crate.
Junction Tables
crates_categories– Links crates to categories.crate_owners– Links crates to users or teams.crates_keywords– Links crates to keywords.dependencies– Links versions to the crates they depend on.
Support Tables
metadata– Contains a single row withtotal_downloads.reserved_crate_names– List of protected/unavailable crate names.default_versions– Links crates to their default version.
Historical Data
- The
version_downloadstable only holds the most recent three months of data. - Older daily CSV dumps (available on the crates.io site) contain
version_idanddownloadscolumns; the date must be inferred from the filename. - All other tables represent the state of the ecosystem at the moment of the dump (e.g.,
crates.created_at,crates.updated_at).
Key takeaways
- The DB dump is refreshed daily, reflecting the ecosystem at download time.
- Each day, one day’s worth of data is removed from the three‑month window and placed in the CSV archives.
- The most important tables for analytics are
crates,versions,version_downloads, anddependencies.
Architecture Overview
ELT Pipeline Design
| Layer | Purpose |
|---|---|
| raw | Direct CSV loads from crates.io dumps |
| staging | Cleaned and validated data (prefix stg_) |
| marts | Analytics‑ready tables (not covered in this post) |
Extract
- Download the crates.io DB dump (
tar.gz) and extract the CSV files.
Load
- Import all CSVs into the
rawschema using a full refresh strategy—each new dump replaces the raw tables, giving a clean snapshot of the current state.
Transform
- Apply data‑quality rules in the
stagingschema:- Normalize all timestamps to UTC.
- Incrementally load
stg_version_downloads(add new dates only). The first run ingests all available dates; subsequent runs add only new data. - Full refresh on dimension tables (
categories,crates,versions, …) to capture updates. - Enforce data contracts and run quality tests.
This approach efficiently handles the three‑month rolling window while preserving a complete historical archive in stg_version_downloads.
Backfilling Historical Downloads
- Historical archives date back to 2014‑11‑11.
- A backfill script (parameterized by start/end dates) ingests older CSVs directly into
stg_version_downloads. - Snapshots of
crates,dependencies, andversionsare also taken to capture changes between dumps.
Constraints & Technology Stack
| Constraint | Solution |
|---|---|
| No infrastructure overhead | DuckDB – embedded, single‑file OLAP database |
| Cross‑platform | Python + uv (fast package manager) |
| Fast iteration | dbt for SQL transformations, testing, and snapshots |
| Storage efficient | DuckDB file ~10 GB for data from 2014‑11‑11 to 2025‑11‑29; runs on a laptop (~8 GB RAM, ~20 GB free disk) |
| Visualization | Streamlit for quick data validation dashboards |
All project dependencies are managed via uv; the only prerequisite is having uv installed.
Investigating Orphan Versions
One interesting data‑quality check involved orphan versions in the stg_version_downloads table—entries where the version_id does not exist in the versions table.
The investigation steps:
-
Identify orphan rows
SELECT vd.version_id FROM stg_version_downloads vd LEFT JOIN stg_versions v ON vd.version_id = v.id WHERE v.id IS NULL LIMIT 100; -
Quantify the issue
SELECT COUNT(*) AS orphan_count FROM stg_version_downloads vd LEFT JOIN stg_versions v ON vd.version_id = v.id WHERE v.id IS NULL; -
Root cause analysis
- Orphans often arise from crates that were yanked or deleted after the download snapshot.
- The three‑month window may contain download records for versions that no longer exist in the current
versionstable.
-
Resolution strategy
- Keep orphan rows for historical completeness, but flag them in downstream analyses.
- Optionally, maintain a separate “historical versions” table that stores metadata for yanked/deleted versions.
This check became part of the automated dbt tests, ensuring future loads surfac