Build Better RAG Pipelines: Scraping Technical Docs to Clean Markdown

Published: 6 days ago (December 11, 2025 at 10:56 PM EST)

2 min read

Source: Dev.to

The Problem with Generic Scraping

If you simply curl a documentation page or use a generic crawler, your LLM context gets flooded with noise:

Navigation menus repeated on every single page (e.g., “Home > Docs > API…”).
Sidebars that confuse semantic search.
Footers, cookie banners, and scripts.
Broken code blocks that lose their language tags.

Your retrieval system may end up matching the “Terms of Service” link in the footer instead of the actual API method you were looking for.

The Solution: A Framework‑Aware Scraper

I built Tech Docs to LLM‑Ready Markdown to solve this exact problem.
Instead of treating every page as a bag of HTML tags, this Apify actor detects the documentation framework (Docusaurus, GitBook, MkDocs, etc.) and intelligently extracts only the content you care about.

Tech Docs to Markdown for RAG & LLM – Apify

🚀 Key Features for RAG Pipelines

1. Smart Framework Detection

Automatically identifies the underlying tech stack and applies specialized extraction rules:

✅ Docusaurus
✅ GitBook
✅ MkDocs (Material)
✅ ReadTheDocs
✅ VuePress / Nextra

2. Auto‑Cleaning

Strips out:

Sidebars & top navigation
“Edit this page” links
Table of contents (redundant for embeddings)
Footers & legal text

3. RAG‑First Output Format 🤖

The scraper outputs structured data designed for vector databases:

doc_id – stable, unique hash of the URL (great for deduplication)
section_path – breadcrumb path (e.g., Guides > Advanced > Configuration)
chunk_index – built‑in chunking support

Example Output

{
    "doc_id": "acdb145c14f4310b",
    "title": "Introduction | Crawlee",
    "section_path": "Guides > Quick Start > Introduction",
    "content": "# Introduction\n\nCrawlee covers your crawling...",
    "framework": "docusaurus",
    "metadata": {
        "wordCount": 358,
        "crawledAt": "2025-12-12T03:34:46.151Z"
    }
}

🛠️ Integration with LangChain

Because the output is already structured, loading it into LangChain is trivial using the ApifyDatasetLoader.

from langchain.document_loaders import ApifyDatasetLoader
from langchain.docstore.document import Document

loader = ApifyDatasetLoader(
    dataset_id="YOUR_DATASET_ID",
    dataset_mapping_function=lambda item: Document(
        page_content=item["content"],
        metadata={
            "source": item["url"],
            "title": item["title"],
            "doc_id": item["doc_id"],
            "section": item["section_path"]  # <--- Filter by section later!
        }
    ),
)

docs = loader.load()
print(f"Loaded {len(docs)} clean documents.")

📉 Cost & Performance

The actor uses a custom lightweight extraction engine (on top of Cheerio), making it fast and cheap:

Pricing: Pay‑per‑result ($0.50 per 1,000 pages)
Speed: Can process hundreds of pages per minute

Try It Out

If you are building an AI assistant for a library, SDK, or internal docs, give it a shot. It saves hours of data‑cleaning time.

Try Tech Docs Scraper

Let me know in the comments if there are other documentation frameworks you’d like me to add! 👇

Build Better RAG Pipelines: Scraping Technical Docs to Clean Markdown

The Problem with Generic Scraping

The Solution: A Framework‑Aware Scraper

🚀 Key Features for RAG Pipelines

1. Smart Framework Detection

2. Auto‑Cleaning

3. RAG‑First Output Format 🤖

Example Output

🛠️ Integration with LangChain

📉 Cost & Performance

Try It Out

Related posts

We found our site was slow in Singapore but perfect in Europe — here's why

I put a Game Boy inside ChatGPT (ChatGPT Apps)

Advent of AI - Day 13: Goose Terminal Integration

A Day in the Life of a Marketing Manager Using Microsoft Planner