Building a Roboflow Universe Search Agent: Automating ML Model Discovery

Published: (January 5, 2026 at 11:43 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Problem

As a machine learning enthusiast, I often find myself browsing Roboflow Universe looking for pre‑trained models.
Manually searching, clicking through pages, and copying API endpoints is tedious. I wanted a way to:

  • Search for models by keywords
  • Extract detailed information (metrics, classes, API endpoints)
  • Get structured data I could use programmatically

So I built a Python web scraper that does exactly that! 🚀

What It Does

The Roboflow Universe Search Agent is a Python tool that:

  • ✅ Searches Roboflow Universe with custom keywords
  • ✅ Extracts model details (title, author, metrics, classes)
  • Finds API endpoints using multiple extraction strategies
  • ✅ Outputs structured JSON data
  • ✅ Handles retries and errors gracefully

The Challenge: Finding API Endpoints

The trickiest part was reliably extracting API endpoints. Roboflow displays them in various places:

  • JavaScript code snippets
  • Model ID variables
  • Input fields
  • Page text
  • Legacy endpoint formats

I needed a robust solution that wouldn’t break if the website structure changed.

The Solution: Multi‑Strategy Extraction

Instead of relying on a single method, I implemented six different extraction strategies with fallbacks.

Strategy 1: JavaScript Code Blocks

The most reliable source – API endpoints appear in code snippets:

js_patterns = [
    r'url:\s*["\']https://serverless\.roboflow\.com/([^"\'?\s]+)["\']',
    r'"https://serverless\.roboflow\.com/([^"\'?\s]+)"',
    r'https://serverless\.roboflow\.com/([a-z0-9\-_]+/\d+)',
]

Strategy 2: Model ID Patterns

Extract from JavaScript variables:

model_id_patterns = [
    r'model_id["\']?\s*[:=]\s*["\']([a-z0-9\-_]+/\d+)["\']',
    r'MODEL_ENDPOINT["\']?\s*[:=]\s*["\']([a-z0-9\-_]+/\d+)["\']',
]

Strategy 3: Input Fields & Textareas

Check form elements and code blocks:

input_selectors = [
    "input[value*='serverless.roboflow.com']",
    "textarea",
    "code",
]

Fallback to visible text on the page.

Strategy 5: Legacy Endpoints

Support older endpoint formats:

  • detect.roboflow.com
  • classify.roboflow.com
  • segment.roboflow.com

Strategy 6: URL Construction

Build the endpoint from the page URL structure if all else fails.

This multi‑strategy approach ensures we find the API endpoint even if the page structure changes!

Tech Stack

  • Playwright – Browser automation (more reliable than requests for dynamic content)
  • Python 3.7+ – Core language
  • Regex – Pattern matching for extraction

Usage

Basic Example

# Search for basketball detection models
SEARCH_KEYWORDS="basketball model object detection" \
MAX_PROJECTS=5 \
python roboflow_search_agent.py

JSON Output

# Get structured JSON output
SEARCH_KEYWORDS="soccer ball instance segmentation" \
OUTPUT_JSON=true \
python roboflow_search_agent.py

Example Output

[
  {
    "project_title": "Basketball Detection",
    "url": "https://universe.roboflow.com/workspace/basketball-detection",
    "author": "John Doe",
    "project_type": "Object Detection",
    "has_model": true,
    "mAP": "85.2%",
    "precision": "87.1%",
    "recall": "83.5%",
    "training_images": "5000",
    "classes": ["basketball", "player"],
    "api_endpoint": "https://serverless.roboflow.com/basketball-detection/1",
    "model_identifier": "workspace/basketball-detection"
  }
]

Key Features

  1. Intelligent Search – The tool applies the “Has a Model” filter automatically and handles keyword prioritization.
  2. Comprehensive Data Extraction – Extracts performance metrics, training data info, project metadata, and the hard‑to‑get API endpoints.
  3. Robust Error Handling – Automatic retries (3 attempts), graceful failure handling, and timeout management.
  4. Flexible Output – Human‑readable console output, JSON format for programmatic use, configurable via environment variables.

Technical Highlights

Browser Automation with Playwright

def connect_browser(headless=True):
    playwright = sync_playwright().start()
    browser = playwright.chromium.launch(
        headless=headless,
        args=["--no-sandbox", "--disable-setuid-sandbox"]
    )
    context = browser.new_context(viewport={"width": 1440, "height": 900})
    page = context.new_page()
    return playwright, browser, context, page

Smart Scrolling

Instead of fixed waits, the scraper detects when content stops loading:

def scroll_page(page, max_scrolls=15):
    last_height = 0
    for _ in range(max_scrolls):
        page.evaluate("window.scrollBy(0, window.innerHeight)")
        page.wait_for_timeout(800)
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

Lessons Learned

  • Multiple Strategies > Single Strategy – Having fallbacks makes the scraper much more reliable.
  • Playwright > Requests – For dynamic sites, browser automation is essential.
  • Pattern Matching – Regex patterns need careful testing with real data.
  • Error Handling – Web scraping is fragile; always include retry logic.

Use Cases

  • Research – Quickly find models for specific tasks.
  • API Discovery – Retrieve ready‑to‑use endpoints for integration.
  • Automation – Feed model metadata into pipelines or dashboards.

Happy scraping!

📦 Installation

# Clone the repository
git clone https://github.com/SumitS10/Roboflow-.git
cd roboflow

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium

🔧 Features

  • Model Comparison – Compare metrics across multiple models.
  • Automation – Seamlessly integrate into ML pipelines.

🚀 Future Improvements

  • Add filtering by metrics (e.g., mAP > 80%)
  • Support for batch processing multiple searches
  • Export results to CSV/Excel
  • Add advanced model‑comparison features
  • Cache results to avoid re‑scraping

🏁 Conclusion

Building this scraper taught me a lot about web scraping, browser automation, and handling edge cases. The multi‑strategy approach for API extraction was key to making it reliable.

If you’re working with Roboflow models or need to automate model discovery, give it a try! Contributions and feedback are welcome.

  • GitHub Repository:
  • Roboflow Universe:
Back to Blog

Related posts

Read more »

The RGB LED Sidequest 💡

markdown !Jennifer Davishttps://media2.dev.to/dynamic/image/width=50,height=50,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%...

Mendex: Why I Build

Introduction Hello everyone. Today I want to share who I am, what I'm building, and why. Early Career and Burnout I started my career as a developer 17 years a...