Inside a Scholarly Search Engine: Indexing, Ranking, and Retrieval
Source: Dev.to
What Are We Actually Building?
To build a functional scholarly search engine we moved beyond simple database queries and implemented an Inverted Index, the core data structure behind most modern search engines.
The Stack
- Core: Python (logic and data structures)
- Web Framework: Flask (API and UI)
- Frontend: HTML, CSS & vanilla JavaScript (lightweight, monolithic)
The Secret Sauce
A custom‑built Inverted Index paired with a BM25 ranking algorithm.
The “Aha!” Moment: Why Simple Counts Do Not Work
Ranking purely by term frequency favors documents that are merely wordy.
BM25 addresses this by:
- Penalizing common words (Inverse Document Frequency).
- Normalizing for document length (Length Normalization).
BM25 Scoring Function (Python)
import math
def score_bm25(n, f, qf, r, N, dl, avdl, k1=1.5, b=0.75, k2=100):
"""
n – number of documents containing the term
f – term frequency in the document
qf – term frequency in the query
r – number of relevant documents containing the term
N – total number of documents
dl – document length
avdl– average document length
"""
# Scaling factor based on document length
K = k1 * ((1 - b) + b * (dl / avdl))
# Relevance component
first = math.log(((r + 0.5) / (N - r + 0.5)) /
((n - r + 0.5) / (N - n - (N - r) + 0.5)))
second = ((k1 + 1) * f) / (K + f)
third = ((k2 + 1) * qf) / (k2 + qf)
return first * second * third
Indexing: The Heavy Lifting
The Inverted Index works like a textbook index: each term maps to a list of document IDs where it appears.
Simplified Indexing Process (Python)
from collections import defaultdict
inverted_index = defaultdict(list)
for doc_id, text in corpus.items():
tokens = preprocess(text) # strip punctuation, lowercase, etc.
for term in tokens:
inverted_index[term].append(doc_id)
Trade‑off:
- Pro: In‑memory index → sub‑millisecond lookups.
- Con: High RAM usage; acceptable for
fetch('response.json')
.then(res => res.json())
.then(data => {
const resultsDiv = document.getElementById('results');
resultsDiv.innerHTML = ''; // Clear old results
data.forEach(paper => {
const item = `
<h3>${paper.title}</h3>
<p>${paper.abstract}</p>
`;
resultsDiv.innerHTML += item;
});
})
.catch(err => console.error('Search error:', err));
Try It Yourself
The full source code is open‑sourced. Clone the repository, follow the README instructions, and run the application locally to explore the indexing and ranking pipeline.