Building an MCP Server for AI-Native Data Discovery: Rust Crates Ecosystem: Part I

Published: (December 4, 2025 at 06:53 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Introduction

Traditional data exploration relies on predefined queries—SQL code, dashboards, BI tools.
What if you could explore a data warehouse conversationally, asking open‑ended questions and letting an AI discover patterns you never thought to look for?

The Model Context Protocol (MCP) makes this possible. I built an MCP server for analyzing the Rust ecosystem and let Claude explore it, answering questions about crates, dependencies, and trends.

If you want to follow along, see the Rust Crates Analytics Repo for the full project.

Understanding the crates.io Data

When you download and extract the crates.io DB dump (see the download link in the README), you receive:

  • A PostgreSQL dump with instructions for loading it locally.
  • A data folder containing CSV files that represent the actual contents.

Entity Overview

EntityDescription
cratesRust packages published to crates.io
versionsSpecific releases of a crate (e.g., serde v1.0.228)
categoriesTaxonomic classifications (e.g., science::bioinformatics)
keywordsUser‑defined tags for discoverability (e.g., cargo, sql)
teamsOrganizational accounts that can own crates (GitHub)
usersIndividual developer accounts (GitHub)

Fact Tables

  • version_downloads – Time‑series download counts per version per day (last 3 months only).
  • crate_downloads – All‑time total download counts per crate.

Junction Tables

  • crates_categories – Links crates to categories.
  • crate_owners – Links crates to users or teams.
  • crates_keywords – Links crates to keywords.
  • dependencies – Links versions to the crates they depend on.

Support Tables

  • metadata – Contains a single row with total_downloads.
  • reserved_crate_names – List of protected/unavailable crate names.
  • default_versions – Links crates to their default version.

Historical Data

  • The version_downloads table only holds the most recent three months of data.
  • Older daily CSV dumps (available on the crates.io site) contain version_id and downloads columns; the date must be inferred from the filename.
  • All other tables represent the state of the ecosystem at the moment of the dump (e.g., crates.created_at, crates.updated_at).

Key takeaways

  • The DB dump is refreshed daily, reflecting the ecosystem at download time.
  • Each day, one day’s worth of data is removed from the three‑month window and placed in the CSV archives.
  • The most important tables for analytics are crates, versions, version_downloads, and dependencies.

Architecture Overview

ELT Pipeline Design

LayerPurpose
rawDirect CSV loads from crates.io dumps
stagingCleaned and validated data (prefix stg_)
martsAnalytics‑ready tables (not covered in this post)

Extract

  • Download the crates.io DB dump (tar.gz) and extract the CSV files.

Load

  • Import all CSVs into the raw schema using a full refresh strategy—each new dump replaces the raw tables, giving a clean snapshot of the current state.

Transform

  • Apply data‑quality rules in the staging schema:
    • Normalize all timestamps to UTC.
    • Incrementally load stg_version_downloads (add new dates only). The first run ingests all available dates; subsequent runs add only new data.
    • Full refresh on dimension tables (categories, crates, versions, …) to capture updates.
    • Enforce data contracts and run quality tests.

This approach efficiently handles the three‑month rolling window while preserving a complete historical archive in stg_version_downloads.

Backfilling Historical Downloads

  • Historical archives date back to 2014‑11‑11.
  • A backfill script (parameterized by start/end dates) ingests older CSVs directly into stg_version_downloads.
  • Snapshots of crates, dependencies, and versions are also taken to capture changes between dumps.

Constraints & Technology Stack

ConstraintSolution
No infrastructure overheadDuckDB – embedded, single‑file OLAP database
Cross‑platformPython + uv (fast package manager)
Fast iterationdbt for SQL transformations, testing, and snapshots
Storage efficientDuckDB file ~10 GB for data from 2014‑11‑11 to 2025‑11‑29; runs on a laptop (~8 GB RAM, ~20 GB free disk)
VisualizationStreamlit for quick data validation dashboards

All project dependencies are managed via uv; the only prerequisite is having uv installed.

Investigating Orphan Versions

One interesting data‑quality check involved orphan versions in the stg_version_downloads table—entries where the version_id does not exist in the versions table.

The investigation steps:

  1. Identify orphan rows

    SELECT vd.version_id
    FROM stg_version_downloads vd
    LEFT JOIN stg_versions v ON vd.version_id = v.id
    WHERE v.id IS NULL
    LIMIT 100;
  2. Quantify the issue

    SELECT COUNT(*) AS orphan_count
    FROM stg_version_downloads vd
    LEFT JOIN stg_versions v ON vd.version_id = v.id
    WHERE v.id IS NULL;
  3. Root cause analysis

    • Orphans often arise from crates that were yanked or deleted after the download snapshot.
    • The three‑month window may contain download records for versions that no longer exist in the current versions table.
  4. Resolution strategy

    • Keep orphan rows for historical completeness, but flag them in downstream analyses.
    • Optionally, maintain a separate “historical versions” table that stores metadata for yanked/deleted versions.

This check became part of the automated dbt tests, ensuring future loads surfac

Back to Blog

Related posts

Read more »

Thoughts on Go vs. Rust vs. Zig

Article URL: https://sinclairtarget.com/blog/2025/08/thoughts-on-go-vs.-rust-vs.-zig/ Comments URL: https://news.ycombinator.com/item?id=46153466 Points: 11 Com...