Building a Perplexity Clone for Local LLMs in 50 Lines of Python
Source: Dev.to
Your local LLM is smart but blind — it can’t see the internet. Here’s how to give it eyes, a filter, and a citation engine. This is a hands-on tutorial. We’ll install a library, run a real query, break down every stage of what happens inside, and look at the actual output your LLM receives. By the end, you’ll have a working pipeline that turns any local model (Ollama, LM Studio, anything with a text input) into something that searches the web, reads pages, ranks the results, and generates a structured prompt with inline citations — like a self-hosted Perplexity. Background: If you want to understand the architecture this is based on, I wrote a deep dive into how Perplexity actually works — the five-stage RAG pipeline, hybrid retrieval on Vespa.ai, Cerebras-accelerated inference, the citation integrity problems. This tutorial is the practical counterpart. Repo: github.com/KazKozDev/production_rag_pipeline A pipeline that does this: Your question ↓ Search (Bing + DuckDuckGo, parallel) ↓ Semantic pre-filter (drop irrelevant results before fetching) ↓ Fetch pages (only the ones that passed filtering) ↓ Extract content (strip boilerplate, ads, navigation) ↓ Chunk + Rerank (BM25 + semantic + answer-span + MMR) ↓ LLM-ready prompt with numbered citations
The pipeline does NOT include the LLM itself — it builds the prompt. You plug in whatever model you want. git clone https://github.com/KazKozDev/production_rag_pipeline.git cd production_rag_pipeline
Pick your install level:
Minimal — BM25 ranking, BeautifulSoup extraction. No ML models.
pip install .
Better extraction with trafilatura
pip install .[extraction]
Semantic ranking with sentence-transformers (recommended)
pip install .[semantic]
Everything
pip install .[full]
For this tutorial, use .[full]. First run will download embedding models (~100–500MB depending on language) — this only happens once. No API keys needed. Bing and DuckDuckGo are queried without authentication. from production_rag_pipeline import build_llm_prompt
prompt = build_llm_prompt(“latest AI news”, lang=“en”) print(prompt)
That’s the entire interface. build_llm_prompt runs the full pipeline — search, filter, fetch, extract, rerank — and returns a formatted string ready to paste into any LLM. production-rag-pipeline “latest AI news”
Or with options:
Search-only mode (no page fetching)
production-rag-pipeline “Bitcoin price” —mode search
Russian query
production-rag-pipeline “новости ИИ” —mode read —lang ru
./run_llm_query.command
This bootstraps a virtual environment automatically on first run. Let’s trace what the pipeline actually does with “latest AI news”. Enable debug mode to see it: from production_rag_pipeline.pipeline import search_extract_rerank
chunks, results, fetched_urls = search_extract_rerank( query=“latest AI news”, num_fetch=8, lang=“en”, debug=True, )
Bing and DuckDuckGo are searched in parallel. Results are merged with position-based scoring — first result from each engine scores highest, and results that appear in both engines get a boost. The pipeline detects keywords like “news”, “latest”, “breaking” and switches DDG to its News index — returning actual articles instead of generic homepages. This is the key optimization. Before fetching any page, the pipeline computes cosine similarity between the query embedding and each result’s title+snippet embedding. Results below threshold get dropped: English: threshold 0.30 Russian: threshold 0.25 In practice, ~11 out of 20 results get filtered — saving about 6 seconds of HTTP fetches. Example from a real run with “LLM agents news”: ✗ flutrackers.com sim=0.12 → filtered (irrelevant) ✓ llm-stats.com sim=0.68 → fetched ✗ reddit.com/r/gaming sim=0.15 → filtered ✓ arxiv.org/abs/2503 sim=0.71 → fetched
No hardcoded domain lists. Pure semantic relevance. Surviving results (typically 5–9 URLs) are fetched in parallel. Content extraction runs a two-stage quality check: Structural check: Does >30% of lines look like numbers/prices/tables? Semantic check: If flagged, is the table relevant to the query? This is how exchange rate tables from cbr.ru pass for a currency query (similarity 0.75) but CS:GO price lists get rejected (similarity 0.05). After extraction, boilerplate is stripped — navigation, ads, newsletter signup patterns, cookie banners. Extracted content is chunked, then reranked by four signals: BM25 — classic lexical term-frequency matching Semantic similarity — cosine between query and chunk embeddings Answer-span detection — does this chunk directly answer the question? MMR diversity — prevents top results from all being paraphrases of the same paragraph Optional: a cross-encoder runs on the final shortlist for maximum accuracy (slower but better). For news queries, freshness penalties apply: Content >7 days old: −1 confidence Content >30 days old: −2 confidence Outdated sources flagged in the prompt with exact age The pipeline builds a structured prompt: from production_rag_pipeline.pipeline import build_llm_context from production_rag_pipeline.prompts import build_llm_prompt
context, source_mapping, grouped_sources = build_llm_context( chunks, results, fetched_urls=fetched_urls, renumber_sources=True, # ← fixes phantom citation numbers )
Citation numbers are renumbered after every filtering step. If three sources survive, they’re numbered [1], [2], [3] — never [1], [3], [7] with phantom gaps. Current date and time are injected into the prompt so the LLM can reason about source freshness. The final prompt looks roughly like this (abbreviated): Current date: 2026-03-20
Answer the user’s question using ONLY the provided sources. Cite sources using [1], [2], etc. Do not make claims without a citation.
=== SOURCES ===
[1] OpenAI announces GPT-5 turbo with 1M context window Source: techcrunch.com | Published: 2026-03-19 OpenAI today released GPT-5 Turbo, featuring a 1 million token context window and im
proved reasoning capabilities…
[2] Google DeepMind publishes Gemini 2.5 technical report Source: blog.google | Published: 2026-03-18 The technical report details architectural changes including mixture-of-experts scaling to 3.2 trillion parameters…
[3] Anthropic raises $5B Series E at $90B valuation Source: reuters.com | Published: 2026-03-17 Anthropic closed a $5 billion funding round, bringing its total raised to over $15 billion…
=== QUESTION ===
latest AI news
Drop this into Ollama, LM Studio, or any API. The model sees curated, relevant, cited content — not raw web pages. from production_rag_pipeline import RAGConfig, build_llm_prompt
config = RAGConfig( num_per_engine=12, # results per search engine top_n_fetch=8, # max pages to fetch fetch_timeout=10, # seconds per page total_context_chunks=12, # chunks in final prompt )
prompt = build_llm_prompt(“latest AI news”, config=config)
production-rag-pipeline “latest AI news” —config config.example.yaml
export RAG_TOP_N_FETCH=8 export RAG_FETCH_TIMEOUT=10 production-rag-pipeline “latest AI news”
Here’s the entire pipeline, from query to LLM-ready prompt, using the module-level API: from production_rag_pipeline.search import search from production_rag_pipeline.fetch import fetch_pages_parallel from production_rag_pipeline.extract import extract_content, chunk_text from production_rag_pipeline.rerank import rerank_chunks from production_rag_pipeline.pipeline import build_llm_context from production_rag_pipeline.prompts import build_llm_prompt
1. Search
query = “latest AI news” results = search(query, num_per_engine=10, lang=“en”)
2. Fetch
urls = [r[“url”] for r in results[:8]] pages = fetch_pages_parallel(urls, timeout=10)
3. Extract + Chunk
all_chunks = [] for url, html in pages.items(): text = extract_content(html, url=url) if text: chunks = chunk_text(text, url=url) all_chunks.extend(chunks)
4. Rerank
ranked = rerank_chunks(query, all_chunks, lang=“en”)
5. Build prompt
context, mapping, sources = build_llm_context( ranked, results, renumber_sources=True ) prompt = build_llm_prompt(query, context=context, sources=sources)
print(prompt)
This is what build_llm_prompt(“latest AI news”) does internally, broken into visible steps. The pipeline works at every install level:
Install Ranking Extraction Speed
pip install . BM25 only BeautifulSoup Fastest, least accurate
pip install .[extraction] BM25 only Trafilatura Better content quality
pip install .[semantic] BM25 + semantic + MMR BeautifulSoup Much better ranking
pip install .[full] BM25 + semantic + cross-encoder + MMR Trafilatura Best quality
No GPU required. Semantic models run on CPU — slower, but functional.
Perplexity production-rag-pipeline
Index 200B+ pre-indexed URLs Real-time Bing + DDG
Latency 358ms median 8–15s on a MacBook
Models 20+ with dynamic routing You choose (Ollama, LM Studio, etc.)
Inference Cerebras CS-3, 1,200 tok/s Your hardware
Cost $20/mo Pro Free
Privacy Cloud Local
Code Closed Open source, MIT
The gap is real — especially on latency and index size. But for a tool that runs on your laptop, feeds any local model, and costs nothing, the tradeoff is worth it. The pipeline auto-detects language by Cyrillic character ratio (10% threshold): English → all-MiniLM-L6-v2 (fast, English-optimized) Russian → paraphrase-multilingual-MiniLM-L12-v2 (13 languages) Cross-encoder reranking also switches models per language. No manual configuration needed. production-rag-pipeline “новости ИИ” —lang ru
This is Part 2 of a series: Part 1 — How Perplexity Actually Searches the Internet (architecture teardown) Part 2 — You’re reading it (build the local equivalent) Star the repo if this is useful: github.com/KazKozDev/production_rag_pipeline Issues and contributions welcome.