Building a Roboflow Universe Search Agent: Automating ML Model Discovery

Published: 3 weeks ago (January 5, 2026 at 11:43 AM EST)

4 min read

Source: Dev.to

Problem

As a machine learning enthusiast, I often find myself browsing Roboflow Universe looking for pre‑trained models.
Manually searching, clicking through pages, and copying API endpoints is tedious. I wanted a way to:

Search for models by keywords
Extract detailed information (metrics, classes, API endpoints)
Get structured data I could use programmatically

So I built a Python web scraper that does exactly that! 🚀

What It Does

The Roboflow Universe Search Agent is a Python tool that:

✅ Searches Roboflow Universe with custom keywords
✅ Extracts model details (title, author, metrics, classes)
✅ Finds API endpoints using multiple extraction strategies
✅ Outputs structured JSON data
✅ Handles retries and errors gracefully

The Challenge: Finding API Endpoints

The trickiest part was reliably extracting API endpoints. Roboflow displays them in various places:

JavaScript code snippets
Model ID variables
Input fields
Page text
Legacy endpoint formats

I needed a robust solution that wouldn’t break if the website structure changed.

The Solution: Multi‑Strategy Extraction

Instead of relying on a single method, I implemented six different extraction strategies with fallbacks.

Strategy 1: JavaScript Code Blocks

The most reliable source – API endpoints appear in code snippets:

js_patterns = [
    r'url:\s*["\']https://serverless\.roboflow\.com/([^"\'?\s]+)["\']',
    r'"https://serverless\.roboflow\.com/([^"\'?\s]+)"',
    r'https://serverless\.roboflow\.com/([a-z0-9\-_]+/\d+)',
]

Strategy 2: Model ID Patterns

Extract from JavaScript variables:

model_id_patterns = [
    r'model_id["\']?\s*[:=]\s*["\']([a-z0-9\-_]+/\d+)["\']',
    r'MODEL_ENDPOINT["\']?\s*[:=]\s*["\']([a-z0-9\-_]+/\d+)["\']',
]

Strategy 3: Input Fields & Textareas

Check form elements and code blocks:

input_selectors = [
    "input[value*='serverless.roboflow.com']",
    "textarea",
    "code",
]

Strategy 4: Page Text Search

Fallback to visible text on the page.

Strategy 5: Legacy Endpoints

Support older endpoint formats:

detect.roboflow.com
classify.roboflow.com
segment.roboflow.com

Strategy 6: URL Construction

Build the endpoint from the page URL structure if all else fails.

This multi‑strategy approach ensures we find the API endpoint even if the page structure changes!

Tech Stack

Playwright – Browser automation (more reliable than requests for dynamic content)
Python 3.7+ – Core language
Regex – Pattern matching for extraction

Usage

Basic Example

# Search for basketball detection models
SEARCH_KEYWORDS="basketball model object detection" \
MAX_PROJECTS=5 \
python roboflow_search_agent.py

JSON Output

# Get structured JSON output
SEARCH_KEYWORDS="soccer ball instance segmentation" \
OUTPUT_JSON=true \
python roboflow_search_agent.py

Example Output

[
  {
    "project_title": "Basketball Detection",
    "url": "https://universe.roboflow.com/workspace/basketball-detection",
    "author": "John Doe",
    "project_type": "Object Detection",
    "has_model": true,
    "mAP": "85.2%",
    "precision": "87.1%",
    "recall": "83.5%",
    "training_images": "5000",
    "classes": ["basketball", "player"],
    "api_endpoint": "https://serverless.roboflow.com/basketball-detection/1",
    "model_identifier": "workspace/basketball-detection"
  }
]

Key Features

Intelligent Search – The tool applies the “Has a Model” filter automatically and handles keyword prioritization.
Comprehensive Data Extraction – Extracts performance metrics, training data info, project metadata, and the hard‑to‑get API endpoints.
Robust Error Handling – Automatic retries (3 attempts), graceful failure handling, and timeout management.
Flexible Output – Human‑readable console output, JSON format for programmatic use, configurable via environment variables.

Technical Highlights

Browser Automation with Playwright

def connect_browser(headless=True):
    playwright = sync_playwright().start()
    browser = playwright.chromium.launch(
        headless=headless,
        args=["--no-sandbox", "--disable-setuid-sandbox"]
    )
    context = browser.new_context(viewport={"width": 1440, "height": 900})
    page = context.new_page()
    return playwright, browser, context, page

Smart Scrolling

Instead of fixed waits, the scraper detects when content stops loading:

def scroll_page(page, max_scrolls=15):
    last_height = 0
    for _ in range(max_scrolls):
        page.evaluate("window.scrollBy(0, window.innerHeight)")
        page.wait_for_timeout(800)
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

Lessons Learned

Multiple Strategies > Single Strategy – Having fallbacks makes the scraper much more reliable.
Playwright > Requests – For dynamic sites, browser automation is essential.
Pattern Matching – Regex patterns need careful testing with real data.
Error Handling – Web scraping is fragile; always include retry logic.

Use Cases

Research – Quickly find models for specific tasks.
API Discovery – Retrieve ready‑to‑use endpoints for integration.
Automation – Feed model metadata into pipelines or dashboards.

Happy scraping!

📦 Installation

# Clone the repository
git clone https://github.com/SumitS10/Roboflow-.git
cd roboflow

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium

🔧 Features

Model Comparison – Compare metrics across multiple models.
Automation – Seamlessly integrate into ML pipelines.

🚀 Future Improvements

Add filtering by metrics (e.g., mAP > 80%)
Support for batch processing multiple searches
Export results to CSV/Excel
Add advanced model‑comparison features
Cache results to avoid re‑scraping

🏁 Conclusion

Building this scraper taught me a lot about web scraping, browser automation, and handling edge cases. The multi‑strategy approach for API extraction was key to making it reliable.

If you’re working with Roboflow models or need to automate model discovery, give it a try! Contributions and feedback are welcome.

🔗 Links

GitHub Repository:
Roboflow Universe:

Building a Roboflow Universe Search Agent: Automating ML Model Discovery

Problem

What It Does

The Challenge: Finding API Endpoints

The Solution: Multi‑Strategy Extraction

Strategy 1: JavaScript Code Blocks

Strategy 2: Model ID Patterns

Strategy 3: Input Fields & Textareas

Strategy 4: Page Text Search

Strategy 5: Legacy Endpoints

Strategy 6: URL Construction

Tech Stack

Usage

Basic Example

JSON Output

Example Output

Key Features

Technical Highlights

Browser Automation with Playwright

Smart Scrolling

Lessons Learned

Use Cases

📦 Installation

🔧 Features

🚀 Future Improvements

🏁 Conclusion

🔗 Links

Related posts

The RGB LED Sidequest 💡

Zapier vs. Custom Code: When to Fire Your 'Glue' Tool

Mendex: Why I Build

Why Apache Ozone is the Preferred Object Store for Big Data

Problem

What It Does

The Challenge: Finding API Endpoints

The Solution: Multi‑Strategy Extraction

Strategy 1: JavaScript Code Blocks

Strategy 2: Model ID Patterns

Strategy 3: Input Fields & Textareas

Strategy 4: Page Text Search

Strategy 5: Legacy Endpoints

Strategy 6: URL Construction

Tech Stack

Usage

Basic Example

JSON Output

Example Output

Key Features

Technical Highlights

Browser Automation with Playwright

Smart Scrolling

Lessons Learned

Use Cases

📦 Installation

🔧 Features

🚀 Future Improvements

🏁 Conclusion

🔗 Links

Related posts

The RGB LED Sidequest 💡

Zapier vs. Custom Code: When to Fire Your 'Glue' Tool

Mendex: Why I Build

Why Apache Ozone is the Preferred Object Store for Big Data

Strategy 1: JavaScript Code Blocks

Strategy 2: Model ID Patterns

Strategy 3: Input Fields & Textareas

Strategy 4: Page Text Search

Strategy 5: Legacy Endpoints

Strategy 6: URL Construction