Extracting Clean Markdown from Any URL: The PageBolt /extract Endpoint

Published: (March 19, 2026 at 02:02 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

The Problem with Raw HTML

When an AI agent reads a web page, feeding the raw HTML to an LLM forces it to wade through scripts, ads, navigation menus, footers, and other boilerplate.
Typical breakdown:

  • Scripts & stylesheets – ignored by the model
  • Navigation menus – ignored
  • Ads & tracking pixels – ignored
  • ≈10 KB of boilerplate – wasted tokens
  • ≈2 KB of actual content – what you need

If the agent processes 50 KB of HTML to extract 2 KB of useful text, about 96 % of the tokens are wasted, increasing cost and latency.

Introducing PageBolt’s /extract Endpoint

PageBolt provides a single‑purpose API that:

  1. Takes a URL.
  2. Extracts the main article/content.
  3. Returns the result as clean Markdown (plus optional metadata).

The response is ready to feed directly into any LLM.

Simple JavaScript Example

// fetch clean Markdown from a URL
const response = await fetch('https://api.pagebolt.com/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer YOUR_API_KEY`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://example.com/blog/article'
  })
});

const data = await response.json();
console.log(data.markdown);
// => "# Article Title\n\nArticle content in clean Markdown..."

Using /extract with an LLM (Anthropic Claude)

const Anthropic = require('@anthropic-ai/sdk');
const client = new Anthropic();

async function summarizeResearchPaper(paperUrl) {
  // 1️⃣ Extract the paper as Markdown
  const extractResponse = await fetch('https://api.pagebolt.com/v1/extract', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.PAGEBOLT_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      url: paperUrl,
      format: 'markdown'
    })
  });

  const { markdown, title, author } = await extractResponse.json();

  // 2️⃣ Pass clean Markdown to Claude
  const message = await client.messages.create({
    model: 'claude-opus-4-5',
    max_tokens: 1024,
    messages: [
      {
        role: 'user',
        content: `Summarize this research paper in 3 bullet points:

Title: ${title}
Author: ${author}

Content:
${markdown}`
      }
    ]
  });

  return message.content[0].text;
}

// Example usage
const summary = await summarizeResearchPaper('https://arxiv.org/pdf/2406.12345');
console.log(summary);

Token Efficiency Gains

ScenarioInput tokensOutput tokensApprox. cost*
Raw HTML (no extraction)50,000500~$1.50
Clean Markdown (with /extract)2,000500~$0.06

*Cost based on typical LLM pricing; actual rates may vary.
Using clean Markdown reduces token usage by ≈25×, cutting cost and speeding up responses. It also improves LLM comprehension because Markdown is a more natural format for text structure.

Common Use Cases

  • Research aggregator – extract and summarize dozens of papers.
  • Competitive intelligence – pull competitor web pages for analysis.
  • Documentation agent – ingest API docs and answer developer questions.
  • News digest – collect daily articles and generate briefings.
  • Content curator – fetch blog posts and categorize them.
  • Customer support – pull help‑center articles to train a support bot.

Sample API Response

{
  "markdown": "# Article Title\n\nArticle content...",
  "title": "Article Title",
  "author": "Author Name",
  "published_date": "2026-03-18",
  "word_count": 1200,
  "estimated_reading_time_minutes": 5
}

The JSON includes the Markdown body plus useful metadata that can be displayed or stored.

Pricing

PlanMonthly priceIncluded extractions
Starter$29500
Growth$795,000
Scale$19950,000

Getting Started

  1. Obtain an API key from pagebolt.dev/pricing.
  2. Make a POST request to /extract with the target URL (and optional parameters).
  3. Consume the returned Markdown in your AI agent—no HTML parsing required.

Start for free with 100 extractions per month (no credit card needed).

Your AI agent now has a direct pipeline from URLs to clean, LLM‑friendly content. No noise, just the data it needs.

0 views
Back to Blog

Related posts

Read more »