Extracting Clean Markdown from Any URL: The PageBolt /extract Endpoint
Source: Dev.to
The Problem with Raw HTML
When an AI agent reads a web page, feeding the raw HTML to an LLM forces it to wade through scripts, ads, navigation menus, footers, and other boilerplate.
Typical breakdown:
- Scripts & stylesheets – ignored by the model
- Navigation menus – ignored
- Ads & tracking pixels – ignored
- ≈10 KB of boilerplate – wasted tokens
- ≈2 KB of actual content – what you need
If the agent processes 50 KB of HTML to extract 2 KB of useful text, about 96 % of the tokens are wasted, increasing cost and latency.
Introducing PageBolt’s /extract Endpoint
PageBolt provides a single‑purpose API that:
- Takes a URL.
- Extracts the main article/content.
- Returns the result as clean Markdown (plus optional metadata).
The response is ready to feed directly into any LLM.
Simple JavaScript Example
// fetch clean Markdown from a URL
const response = await fetch('https://api.pagebolt.com/v1/extract', {
method: 'POST',
headers: {
'Authorization': `Bearer YOUR_API_KEY`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: 'https://example.com/blog/article'
})
});
const data = await response.json();
console.log(data.markdown);
// => "# Article Title\n\nArticle content in clean Markdown..."
Using /extract with an LLM (Anthropic Claude)
const Anthropic = require('@anthropic-ai/sdk');
const client = new Anthropic();
async function summarizeResearchPaper(paperUrl) {
// 1️⃣ Extract the paper as Markdown
const extractResponse = await fetch('https://api.pagebolt.com/v1/extract', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.PAGEBOLT_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: paperUrl,
format: 'markdown'
})
});
const { markdown, title, author } = await extractResponse.json();
// 2️⃣ Pass clean Markdown to Claude
const message = await client.messages.create({
model: 'claude-opus-4-5',
max_tokens: 1024,
messages: [
{
role: 'user',
content: `Summarize this research paper in 3 bullet points:
Title: ${title}
Author: ${author}
Content:
${markdown}`
}
]
});
return message.content[0].text;
}
// Example usage
const summary = await summarizeResearchPaper('https://arxiv.org/pdf/2406.12345');
console.log(summary);
Token Efficiency Gains
| Scenario | Input tokens | Output tokens | Approx. cost* |
|---|---|---|---|
| Raw HTML (no extraction) | 50,000 | 500 | ~$1.50 |
Clean Markdown (with /extract) | 2,000 | 500 | ~$0.06 |
*Cost based on typical LLM pricing; actual rates may vary.
Using clean Markdown reduces token usage by ≈25×, cutting cost and speeding up responses. It also improves LLM comprehension because Markdown is a more natural format for text structure.
Common Use Cases
- Research aggregator – extract and summarize dozens of papers.
- Competitive intelligence – pull competitor web pages for analysis.
- Documentation agent – ingest API docs and answer developer questions.
- News digest – collect daily articles and generate briefings.
- Content curator – fetch blog posts and categorize them.
- Customer support – pull help‑center articles to train a support bot.
Sample API Response
{
"markdown": "# Article Title\n\nArticle content...",
"title": "Article Title",
"author": "Author Name",
"published_date": "2026-03-18",
"word_count": 1200,
"estimated_reading_time_minutes": 5
}
The JSON includes the Markdown body plus useful metadata that can be displayed or stored.
Pricing
| Plan | Monthly price | Included extractions |
|---|---|---|
| Starter | $29 | 500 |
| Growth | $79 | 5,000 |
| Scale | $199 | 50,000 |
Getting Started
- Obtain an API key from pagebolt.dev/pricing.
- Make a POST request to
/extractwith the target URL (and optional parameters). - Consume the returned Markdown in your AI agent—no HTML parsing required.
Start for free with 100 extractions per month (no credit card needed).
Your AI agent now has a direct pipeline from URLs to clean, LLM‑friendly content. No noise, just the data it needs.