Building a Perplexity Clone for Local LLMs in 50 Lines of Python

Published: (March 20, 2026 at 01:42 AM EDT)
7 min read
Source: Dev.to

Source: Dev.to

Your local LLM is smart but blind — it can’t see the internet. Here’s how to give it eyes, a filter, and a citation engine. This is a hands-on tutorial. We’ll install a library, run a real query, break down every stage of what happens inside, and look at the actual output your LLM receives. By the end, you’ll have a working pipeline that turns any local model (Ollama, LM Studio, anything with a text input) into something that searches the web, reads pages, ranks the results, and generates a structured prompt with inline citations — like a self-hosted Perplexity. Background: If you want to understand the architecture this is based on, I wrote a deep dive into how Perplexity actually works — the five-stage RAG pipeline, hybrid retrieval on Vespa.ai, Cerebras-accelerated inference, the citation integrity problems. This tutorial is the practical counterpart. Repo: github.com/KazKozDev/production_rag_pipeline A pipeline that does this: Your question ↓ Search (Bing + DuckDuckGo, parallel) ↓ Semantic pre-filter (drop irrelevant results before fetching) ↓ Fetch pages (only the ones that passed filtering) ↓ Extract content (strip boilerplate, ads, navigation) ↓ Chunk + Rerank (BM25 + semantic + answer-span + MMR) ↓ LLM-ready prompt with numbered citations

The pipeline does NOT include the LLM itself — it builds the prompt. You plug in whatever model you want. git clone https://github.com/KazKozDev/production_rag_pipeline.git cd production_rag_pipeline

Pick your install level:

Minimal — BM25 ranking, BeautifulSoup extraction. No ML models.

pip install .

Better extraction with trafilatura

pip install .[extraction]

Semantic ranking with sentence-transformers (recommended)

pip install .[semantic]

Everything

pip install .[full]

For this tutorial, use .[full]. First run will download embedding models (~100–500MB depending on language) — this only happens once. No API keys needed. Bing and DuckDuckGo are queried without authentication. from production_rag_pipeline import build_llm_prompt

prompt = build_llm_prompt(“latest AI news”, lang=“en”) print(prompt)

That’s the entire interface. build_llm_prompt runs the full pipeline — search, filter, fetch, extract, rerank — and returns a formatted string ready to paste into any LLM. production-rag-pipeline “latest AI news”

Or with options:

Search-only mode (no page fetching)

production-rag-pipeline “Bitcoin price” —mode search

Russian query

production-rag-pipeline “новости ИИ” —mode read —lang ru

./run_llm_query.command

This bootstraps a virtual environment automatically on first run. Let’s trace what the pipeline actually does with “latest AI news”. Enable debug mode to see it: from production_rag_pipeline.pipeline import search_extract_rerank

chunks, results, fetched_urls = search_extract_rerank( query=“latest AI news”, num_fetch=8, lang=“en”, debug=True, )

Bing and DuckDuckGo are searched in parallel. Results are merged with position-based scoring — first result from each engine scores highest, and results that appear in both engines get a boost. The pipeline detects keywords like “news”, “latest”, “breaking” and switches DDG to its News index — returning actual articles instead of generic homepages. This is the key optimization. Before fetching any page, the pipeline computes cosine similarity between the query embedding and each result’s title+snippet embedding. Results below threshold get dropped: English: threshold 0.30 Russian: threshold 0.25 In practice, ~11 out of 20 results get filtered — saving about 6 seconds of HTTP fetches. Example from a real run with “LLM agents news”: ✗ flutrackers.com sim=0.12 → filtered (irrelevant) ✓ llm-stats.com sim=0.68 → fetched ✗ reddit.com/r/gaming sim=0.15 → filtered ✓ arxiv.org/abs/2503 sim=0.71 → fetched

No hardcoded domain lists. Pure semantic relevance. Surviving results (typically 5–9 URLs) are fetched in parallel. Content extraction runs a two-stage quality check: Structural check: Does >30% of lines look like numbers/prices/tables? Semantic check: If flagged, is the table relevant to the query? This is how exchange rate tables from cbr.ru pass for a currency query (similarity 0.75) but CS:GO price lists get rejected (similarity 0.05). After extraction, boilerplate is stripped — navigation, ads, newsletter signup patterns, cookie banners. Extracted content is chunked, then reranked by four signals: BM25 — classic lexical term-frequency matching Semantic similarity — cosine between query and chunk embeddings Answer-span detection — does this chunk directly answer the question? MMR diversity — prevents top results from all being paraphrases of the same paragraph Optional: a cross-encoder runs on the final shortlist for maximum accuracy (slower but better). For news queries, freshness penalties apply: Content >7 days old: −1 confidence Content >30 days old: −2 confidence Outdated sources flagged in the prompt with exact age The pipeline builds a structured prompt: from production_rag_pipeline.pipeline import build_llm_context from production_rag_pipeline.prompts import build_llm_prompt

context, source_mapping, grouped_sources = build_llm_context( chunks, results, fetched_urls=fetched_urls, renumber_sources=True, # ← fixes phantom citation numbers )

Citation numbers are renumbered after every filtering step. If three sources survive, they’re numbered [1], [2], [3] — never [1], [3], [7] with phantom gaps. Current date and time are injected into the prompt so the LLM can reason about source freshness. The final prompt looks roughly like this (abbreviated): Current date: 2026-03-20

Answer the user’s question using ONLY the provided sources. Cite sources using [1], [2], etc. Do not make claims without a citation.

=== SOURCES ===

[1] OpenAI announces GPT-5 turbo with 1M context window Source: techcrunch.com | Published: 2026-03-19 OpenAI today released GPT-5 Turbo, featuring a 1 million token context window and im

proved reasoning capabilities…

[2] Google DeepMind publishes Gemini 2.5 technical report Source: blog.google | Published: 2026-03-18 The technical report details architectural changes including mixture-of-experts scaling to 3.2 trillion parameters…

[3] Anthropic raises $5B Series E at $90B valuation Source: reuters.com | Published: 2026-03-17 Anthropic closed a $5 billion funding round, bringing its total raised to over $15 billion…

=== QUESTION ===

latest AI news

Drop this into Ollama, LM Studio, or any API. The model sees curated, relevant, cited content — not raw web pages. from production_rag_pipeline import RAGConfig, build_llm_prompt

config = RAGConfig( num_per_engine=12, # results per search engine top_n_fetch=8, # max pages to fetch fetch_timeout=10, # seconds per page total_context_chunks=12, # chunks in final prompt )

prompt = build_llm_prompt(“latest AI news”, config=config)

production-rag-pipeline “latest AI news” —config config.example.yaml

export RAG_TOP_N_FETCH=8 export RAG_FETCH_TIMEOUT=10 production-rag-pipeline “latest AI news”

Here’s the entire pipeline, from query to LLM-ready prompt, using the module-level API: from production_rag_pipeline.search import search from production_rag_pipeline.fetch import fetch_pages_parallel from production_rag_pipeline.extract import extract_content, chunk_text from production_rag_pipeline.rerank import rerank_chunks from production_rag_pipeline.pipeline import build_llm_context from production_rag_pipeline.prompts import build_llm_prompt

1. Search

query = “latest AI news” results = search(query, num_per_engine=10, lang=“en”)

2. Fetch

urls = [r[“url”] for r in results[:8]] pages = fetch_pages_parallel(urls, timeout=10)

3. Extract + Chunk

all_chunks = [] for url, html in pages.items(): text = extract_content(html, url=url) if text: chunks = chunk_text(text, url=url) all_chunks.extend(chunks)

4. Rerank

ranked = rerank_chunks(query, all_chunks, lang=“en”)

5. Build prompt

context, mapping, sources = build_llm_context( ranked, results, renumber_sources=True ) prompt = build_llm_prompt(query, context=context, sources=sources)

print(prompt)

This is what build_llm_prompt(“latest AI news”) does internally, broken into visible steps. The pipeline works at every install level:

Install Ranking Extraction Speed

pip install . BM25 only BeautifulSoup Fastest, least accurate

pip install .[extraction] BM25 only Trafilatura Better content quality

pip install .[semantic] BM25 + semantic + MMR BeautifulSoup Much better ranking

pip install .[full] BM25 + semantic + cross-encoder + MMR Trafilatura Best quality

No GPU required. Semantic models run on CPU — slower, but functional.

Perplexity production-rag-pipeline

Index 200B+ pre-indexed URLs Real-time Bing + DDG

Latency 358ms median 8–15s on a MacBook

Models 20+ with dynamic routing You choose (Ollama, LM Studio, etc.)

Inference Cerebras CS-3, 1,200 tok/s Your hardware

Cost $20/mo Pro Free

Privacy Cloud Local

Code Closed Open source, MIT

The gap is real — especially on latency and index size. But for a tool that runs on your laptop, feeds any local model, and costs nothing, the tradeoff is worth it. The pipeline auto-detects language by Cyrillic character ratio (10% threshold): English → all-MiniLM-L6-v2 (fast, English-optimized) Russian → paraphrase-multilingual-MiniLM-L12-v2 (13 languages) Cross-encoder reranking also switches models per language. No manual configuration needed. production-rag-pipeline “новости ИИ” —lang ru

This is Part 2 of a series: Part 1 — How Perplexity Actually Searches the Internet (architecture teardown) Part 2 — You’re reading it (build the local equivalent) Star the repo if this is useful: github.com/KazKozDev/production_rag_pipeline Issues and contributions welcome.

0 views
Back to Blog

Related posts

Read more »