Building Indx.sh - Automating Content Discovery: How We Crawl GitHub for AI Resources

Published: (February 2, 2026 at 01:20 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

TL;DR
I built automated crawlers that discover AI coding prompts, skills, and MCP servers from GitHub, running daily via Vercel cron jobs. Here’s how.

The Problem

When I launched indx.sh, the AI coding ecosystem was moving too fast to track manually:

  • New MCP servers appear daily
  • Developers constantly publish cursor rules and skill definitions
  • Official repositories receive frequent updates
  • Star counts change

Manually keeping up was impossible.

Solution Overview

I created three automated crawlers that run every night:

CrawlerWhat it finds
Prompts Crawler.cursorrules, CLAUDE.md, copilot-instructions.md files
Skills CrawlerRepositories containing SKILL.md files
MCP CrawlerModel Context Protocol (MCP) servers (no single file convention)

All crawlers are executed as Vercel cron jobs, keeping the index fresh without any manual intervention.

Prompts Crawler

// Files we look for
const FILE_SEARCHES = [
  { query: 'filename:.cursorrules', tool: 'cursor' },
  { query: 'filename:CLAUDE.md', tool: 'claude-code' },
  { query: 'filename:copilot-instructions.md', tool: 'copilot' },
];

// Repository‑level queries
const REPO_SEARCHES = [
  'cursor-rules in:name,description',
  'awesome-cursorrules',
  'topic:cursor-rules',
];

Processing steps for each file found

  1. Fetch the raw content from GitHub.
  2. Generate a slug in the form owner-repo-filename.
  3. Infer category and tags from the file content.
  4. Auto‑verify repos that have 100+ stars.
  5. Upsert the record into the database.

The first run indexed 175 prompts across Cursor, Claude Code, and Copilot.

Skills Crawler

// Search GitHub for SKILL.md files
const { items } = await searchGitHub('filename:SKILL.md');

for (const item of items) {
  // Fetch the actual SKILL.md content
  const content = await fetchFileContent(owner, repo, item.path);

  // Parse frontmatter (name, description, tags)
  const metadata = parseFrontmatter(content);

  // Upsert to database
  await prisma.skill.upsert({
    where: { slug },
    create: { ...metadata, content, githubStars },
    update: { githubStars }, // Keep stars fresh
  });
}

Key insight

GitHub’s code‑search API lets you query by filename, so filename:SKILL.md returns every repository that contains such a file.

MCP Crawler

MCP servers lack a single‑file convention, so I employ multiple search strategies:

const SEARCH_STRATEGIES = [
  'mcp server in:name,description',
  'model context protocol server',
  'topic:mcp',
  '@modelcontextprotocol/server',
  'mcp server typescript',
  'mcp server python',
];

For each strategy:

  1. Search GitHub repositories sorted by stars.
  2. Filter results for MCP‑related content.
  3. Fetch package.json (when applicable) to obtain npm package names.
  4. Infer categories from repository description and topics.
  5. Mark official repos (e.g., those under the modelcontextprotocol org) as verified.

Cron Schedule (Vercel)

{
  "crons": [
    { "path": "/api/cron/sync-github-stats", "schedule": "0 3 * * *" },
    { "path": "/api/cron/crawl-skills",      "schedule": "0 4 * * *" },
    { "path": "/api/cron/crawl-mcp",        "schedule": "0 5 * * *" },
    { "path": "/api/cron/crawl-prompts",    "schedule": "0 6 * * *" }
  ]
}

Every night (UTC)

  • 03:00 – Sync GitHub star counts for existing resources
  • 04:00 – Discover new skills
  • 05:00 – Discover new MCP servers
  • 06:00 – Discover new prompts/rules

Handling GitHub API Limits

  • Unauthenticated: 10 requests/minute
  • Authenticated (with token): 5,000 requests/hour

Strategies

if (res.status === 403) {
  const resetTime = res.headers.get('X-RateLimit-Reset');
  console.log(`Rate limited. Resets at ${new Date(resetTime * 1000)}`);
  await sleep(60000); // Wait and retry
}
  • Small delays between requests
  • Process items in batches (≈ 50 per cron run)
  • Graceful retry on rate‑limit errors

Lessons Learned

  1. Incremental over bulk – Early attempts to crawl everything at once caused timeouts and chaos. Processing ~50 items per run is stable.
  2. Deduplication by slug – The same repo can appear in multiple search strategies; using a consistent owner-repo-path slug and upserting avoids duplicates.
  3. Don’t trust descriptions – Many repos have empty or misleading descriptions. Fallback: "AI rules from {owner}/{repo}".
  4. Official = trusted – Repos from modelcontextprotocol, anthropics, or anthropic-ai orgs receive auto‑verified badges. Community repos require manual verification.

Results (after a few weeks)

  • 790+ MCP servers indexed
  • 1,300+ skills discovered
  • 300+ prompts/rules indexed
  • Daily updates keep star counts fresh

Note: GitHub search isn’t perfect; false positives (e.g., repos mentioning “mcp” but not actually providing a server) still require manual review. The 50‑item limit per cron run also means full indexing can take several days, especially on Vercel’s hobby plan with a 10‑second timeout.

Future Improvements

  • Better category inference using AI
  • Richer README parsing for detailed descriptions
  • Automatic quality scoring based on stars, activity, and documentation
  • User submissions to fill gaps

Browse the Index

Explore the auto‑discovered resources at indx.sh:

  • Rules & Prompts – Cursor, Claude Code, Copilot rules
  • MCP Servers – Sorted by GitHub stars
  • Skills – Searchable by name and tags

If a resource is missing, you can submit it manually or wait for the crawlers to pick it up.

This post is part 2 of the “Building indx.sh” series.

Back to Blog

Related posts

Read more »