Building Indx.sh - Automating Content Discovery: How We Crawl GitHub for AI Resources
Source: Dev.to
TL;DR
I built automated crawlers that discover AI coding prompts, skills, and MCP servers from GitHub, running daily via Vercel cron jobs. Here’s how.
The Problem
When I launched indx.sh, the AI coding ecosystem was moving too fast to track manually:
- New MCP servers appear daily
- Developers constantly publish cursor rules and skill definitions
- Official repositories receive frequent updates
- Star counts change
Manually keeping up was impossible.
Solution Overview
I created three automated crawlers that run every night:
| Crawler | What it finds |
|---|---|
| Prompts Crawler | .cursorrules, CLAUDE.md, copilot-instructions.md files |
| Skills Crawler | Repositories containing SKILL.md files |
| MCP Crawler | Model Context Protocol (MCP) servers (no single file convention) |
All crawlers are executed as Vercel cron jobs, keeping the index fresh without any manual intervention.
Prompts Crawler
// Files we look for
const FILE_SEARCHES = [
{ query: 'filename:.cursorrules', tool: 'cursor' },
{ query: 'filename:CLAUDE.md', tool: 'claude-code' },
{ query: 'filename:copilot-instructions.md', tool: 'copilot' },
];
// Repository‑level queries
const REPO_SEARCHES = [
'cursor-rules in:name,description',
'awesome-cursorrules',
'topic:cursor-rules',
];
Processing steps for each file found
- Fetch the raw content from GitHub.
- Generate a slug in the form
owner-repo-filename. - Infer category and tags from the file content.
- Auto‑verify repos that have 100+ stars.
- Upsert the record into the database.
The first run indexed 175 prompts across Cursor, Claude Code, and Copilot.
Skills Crawler
// Search GitHub for SKILL.md files
const { items } = await searchGitHub('filename:SKILL.md');
for (const item of items) {
// Fetch the actual SKILL.md content
const content = await fetchFileContent(owner, repo, item.path);
// Parse frontmatter (name, description, tags)
const metadata = parseFrontmatter(content);
// Upsert to database
await prisma.skill.upsert({
where: { slug },
create: { ...metadata, content, githubStars },
update: { githubStars }, // Keep stars fresh
});
}
Key insight
GitHub’s code‑search API lets you query by filename, so filename:SKILL.md returns every repository that contains such a file.
MCP Crawler
MCP servers lack a single‑file convention, so I employ multiple search strategies:
const SEARCH_STRATEGIES = [
'mcp server in:name,description',
'model context protocol server',
'topic:mcp',
'@modelcontextprotocol/server',
'mcp server typescript',
'mcp server python',
];
For each strategy:
- Search GitHub repositories sorted by stars.
- Filter results for MCP‑related content.
- Fetch
package.json(when applicable) to obtain npm package names. - Infer categories from repository description and topics.
- Mark official repos (e.g., those under the
modelcontextprotocolorg) as verified.
Cron Schedule (Vercel)
{
"crons": [
{ "path": "/api/cron/sync-github-stats", "schedule": "0 3 * * *" },
{ "path": "/api/cron/crawl-skills", "schedule": "0 4 * * *" },
{ "path": "/api/cron/crawl-mcp", "schedule": "0 5 * * *" },
{ "path": "/api/cron/crawl-prompts", "schedule": "0 6 * * *" }
]
}
Every night (UTC)
- 03:00 – Sync GitHub star counts for existing resources
- 04:00 – Discover new skills
- 05:00 – Discover new MCP servers
- 06:00 – Discover new prompts/rules
Handling GitHub API Limits
- Unauthenticated: 10 requests/minute
- Authenticated (with token): 5,000 requests/hour
Strategies
if (res.status === 403) {
const resetTime = res.headers.get('X-RateLimit-Reset');
console.log(`Rate limited. Resets at ${new Date(resetTime * 1000)}`);
await sleep(60000); // Wait and retry
}
- Small delays between requests
- Process items in batches (≈ 50 per cron run)
- Graceful retry on rate‑limit errors
Lessons Learned
- Incremental over bulk – Early attempts to crawl everything at once caused timeouts and chaos. Processing ~50 items per run is stable.
- Deduplication by slug – The same repo can appear in multiple search strategies; using a consistent
owner-repo-pathslug and upserting avoids duplicates. - Don’t trust descriptions – Many repos have empty or misleading descriptions. Fallback:
"AI rules from {owner}/{repo}". - Official = trusted – Repos from
modelcontextprotocol,anthropics, oranthropic-aiorgs receive auto‑verified badges. Community repos require manual verification.
Results (after a few weeks)
- 790+ MCP servers indexed
- 1,300+ skills discovered
- 300+ prompts/rules indexed
- Daily updates keep star counts fresh
Note: GitHub search isn’t perfect; false positives (e.g., repos mentioning “mcp” but not actually providing a server) still require manual review. The 50‑item limit per cron run also means full indexing can take several days, especially on Vercel’s hobby plan with a 10‑second timeout.
Future Improvements
- Better category inference using AI
- Richer README parsing for detailed descriptions
- Automatic quality scoring based on stars, activity, and documentation
- User submissions to fill gaps
Browse the Index
Explore the auto‑discovered resources at indx.sh:
- Rules & Prompts – Cursor, Claude Code, Copilot rules
- MCP Servers – Sorted by GitHub stars
- Skills – Searchable by name and tags
If a resource is missing, you can submit it manually or wait for the crawlers to pick it up.
This post is part 2 of the “Building indx.sh” series.