Build Better RAG Pipelines: Scraping Technical Docs to Clean Markdown
Source: Dev.to
The Problem with Generic Scraping
If you simply curl a documentation page or use a generic crawler, your LLM context gets flooded with noise:
- Navigation menus repeated on every single page (e.g., “Home > Docs > API…”).
- Sidebars that confuse semantic search.
- Footers, cookie banners, and scripts.
- Broken code blocks that lose their language tags.
Your retrieval system may end up matching the “Terms of Service” link in the footer instead of the actual API method you were looking for.
The Solution: A Framework‑Aware Scraper
I built Tech Docs to LLM‑Ready Markdown to solve this exact problem.
Instead of treating every page as a bag of HTML tags, this Apify actor detects the documentation framework (Docusaurus, GitBook, MkDocs, etc.) and intelligently extracts only the content you care about.

🚀 Key Features for RAG Pipelines
1. Smart Framework Detection
Automatically identifies the underlying tech stack and applies specialized extraction rules:
- ✅ Docusaurus
- ✅ GitBook
- ✅ MkDocs (Material)
- ✅ ReadTheDocs
- ✅ VuePress / Nextra
2. Auto‑Cleaning
Strips out:
- Sidebars & top navigation
- “Edit this page” links
- Table of contents (redundant for embeddings)
- Footers & legal text
3. RAG‑First Output Format 🤖
The scraper outputs structured data designed for vector databases:
doc_id– stable, unique hash of the URL (great for deduplication)section_path– breadcrumb path (e.g.,Guides > Advanced > Configuration)chunk_index– built‑in chunking support
Example Output
{
"doc_id": "acdb145c14f4310b",
"title": "Introduction | Crawlee",
"section_path": "Guides > Quick Start > Introduction",
"content": "# Introduction\n\nCrawlee covers your crawling...",
"framework": "docusaurus",
"metadata": {
"wordCount": 358,
"crawledAt": "2025-12-12T03:34:46.151Z"
}
}
🛠️ Integration with LangChain
Because the output is already structured, loading it into LangChain is trivial using the ApifyDatasetLoader.
from langchain.document_loaders import ApifyDatasetLoader
from langchain.docstore.document import Document
loader = ApifyDatasetLoader(
dataset_id="YOUR_DATASET_ID",
dataset_mapping_function=lambda item: Document(
page_content=item["content"],
metadata={
"source": item["url"],
"title": item["title"],
"doc_id": item["doc_id"],
"section": item["section_path"] # <--- Filter by section later!
}
),
)
docs = loader.load()
print(f"Loaded {len(docs)} clean documents.")
📉 Cost & Performance
The actor uses a custom lightweight extraction engine (on top of Cheerio), making it fast and cheap:
- Pricing: Pay‑per‑result ($0.50 per 1,000 pages)
- Speed: Can process hundreds of pages per minute
Try It Out
If you are building an AI assistant for a library, SDK, or internal docs, give it a shot. It saves hours of data‑cleaning time.
Let me know in the comments if there are other documentation frameworks you’d like me to add! 👇