Building Indx.sh - Automating Content Discovery: How We Crawl GitHub for AI Resources

Published: 1 day ago (February 2, 2026 at 01:20 PM EST)

4 min read

Source: Dev.to

TL;DR
I built automated crawlers that discover AI coding prompts, skills, and MCP servers from GitHub, running daily via Vercel cron jobs. Here’s how.

The Problem

When I launched indx.sh, the AI coding ecosystem was moving too fast to track manually:

New MCP servers appear daily
Developers constantly publish cursor rules and skill definitions
Official repositories receive frequent updates
Star counts change

Manually keeping up was impossible.

Solution Overview

I created three automated crawlers that run every night:

Crawler	What it finds
Prompts Crawler	`.cursorrules`, `CLAUDE.md`, `copilot-instructions.md` files
Skills Crawler	Repositories containing `SKILL.md` files
MCP Crawler	Model Context Protocol (MCP) servers (no single file convention)

All crawlers are executed as Vercel cron jobs, keeping the index fresh without any manual intervention.

Prompts Crawler

// Files we look for
const FILE_SEARCHES = [
  { query: 'filename:.cursorrules', tool: 'cursor' },
  { query: 'filename:CLAUDE.md', tool: 'claude-code' },
  { query: 'filename:copilot-instructions.md', tool: 'copilot' },
];

// Repository‑level queries
const REPO_SEARCHES = [
  'cursor-rules in:name,description',
  'awesome-cursorrules',
  'topic:cursor-rules',
];

Processing steps for each file found

Fetch the raw content from GitHub.
Generate a slug in the form owner-repo-filename.
Infer category and tags from the file content.
Auto‑verify repos that have 100+ stars.
Upsert the record into the database.

The first run indexed 175 prompts across Cursor, Claude Code, and Copilot.

Skills Crawler

// Search GitHub for SKILL.md files
const { items } = await searchGitHub('filename:SKILL.md');

for (const item of items) {
  // Fetch the actual SKILL.md content
  const content = await fetchFileContent(owner, repo, item.path);

  // Parse frontmatter (name, description, tags)
  const metadata = parseFrontmatter(content);

  // Upsert to database
  await prisma.skill.upsert({
    where: { slug },
    create: { ...metadata, content, githubStars },
    update: { githubStars }, // Keep stars fresh
  });
}

Key insight

GitHub’s code‑search API lets you query by filename, so filename:SKILL.md returns every repository that contains such a file.

MCP Crawler

MCP servers lack a single‑file convention, so I employ multiple search strategies:

const SEARCH_STRATEGIES = [
  'mcp server in:name,description',
  'model context protocol server',
  'topic:mcp',
  '@modelcontextprotocol/server',
  'mcp server typescript',
  'mcp server python',
];

For each strategy:

Search GitHub repositories sorted by stars.
Filter results for MCP‑related content.
Fetch package.json (when applicable) to obtain npm package names.
Infer categories from repository description and topics.
Mark official repos (e.g., those under the modelcontextprotocol org) as verified.

Cron Schedule (Vercel)

{
  "crons": [
    { "path": "/api/cron/sync-github-stats", "schedule": "0 3 * * *" },
    { "path": "/api/cron/crawl-skills",      "schedule": "0 4 * * *" },
    { "path": "/api/cron/crawl-mcp",        "schedule": "0 5 * * *" },
    { "path": "/api/cron/crawl-prompts",    "schedule": "0 6 * * *" }
  ]
}

Every night (UTC)

03:00 – Sync GitHub star counts for existing resources
04:00 – Discover new skills
05:00 – Discover new MCP servers
06:00 – Discover new prompts/rules

Handling GitHub API Limits

Unauthenticated: 10 requests/minute
Authenticated (with token): 5,000 requests/hour

Strategies

if (res.status === 403) {
  const resetTime = res.headers.get('X-RateLimit-Reset');
  console.log(`Rate limited. Resets at ${new Date(resetTime * 1000)}`);
  await sleep(60000); // Wait and retry
}

Small delays between requests
Process items in batches (≈ 50 per cron run)
Graceful retry on rate‑limit errors

Lessons Learned

Incremental over bulk – Early attempts to crawl everything at once caused timeouts and chaos. Processing ~50 items per run is stable.
Deduplication by slug – The same repo can appear in multiple search strategies; using a consistent owner-repo-path slug and upserting avoids duplicates.
Don’t trust descriptions – Many repos have empty or misleading descriptions. Fallback: "AI rules from {owner}/{repo}".
Official = trusted – Repos from modelcontextprotocol, anthropics, or anthropic-ai orgs receive auto‑verified badges. Community repos require manual verification.

Results (after a few weeks)

790+ MCP servers indexed
1,300+ skills discovered
300+ prompts/rules indexed
Daily updates keep star counts fresh

Note: GitHub search isn’t perfect; false positives (e.g., repos mentioning “mcp” but not actually providing a server) still require manual review. The 50‑item limit per cron run also means full indexing can take several days, especially on Vercel’s hobby plan with a 10‑second timeout.

Future Improvements

Better category inference using AI
Richer README parsing for detailed descriptions
Automatic quality scoring based on stars, activity, and documentation
User submissions to fill gaps

Browse the Index

Explore the auto‑discovered resources at indx.sh:

Rules & Prompts – Cursor, Claude Code, Copilot rules
MCP Servers – Sorted by GitHub stars
Skills – Searchable by name and tags

If a resource is missing, you can submit it manually or wait for the crawlers to pick it up.

This post is part 2 of the “Building indx.sh” series.