Inside a Scholarly Search Engine: Indexing, Ranking, and Retrieval

Published: 0 month ago (January 11, 2026 at 02:35 AM EST)

2 min read

Source: Dev.to

What Are We Actually Building?

To build a functional scholarly search engine we moved beyond simple database queries and implemented an Inverted Index, the core data structure behind most modern search engines.

The Stack

Core: Python (logic and data structures)
Web Framework: Flask (API and UI)
Frontend: HTML, CSS & vanilla JavaScript (lightweight, monolithic)

The Secret Sauce

A custom‑built Inverted Index paired with a BM25 ranking algorithm.

The “Aha!” Moment: Why Simple Counts Do Not Work

Ranking purely by term frequency favors documents that are merely wordy.
BM25 addresses this by:

Penalizing common words (Inverse Document Frequency).
Normalizing for document length (Length Normalization).

BM25 Scoring Function (Python)

import math

def score_bm25(n, f, qf, r, N, dl, avdl, k1=1.5, b=0.75, k2=100):
    """
    n   – number of documents containing the term
    f   – term frequency in the document
    qf  – term frequency in the query
    r   – number of relevant documents containing the term
    N   – total number of documents
    dl  – document length
    avdl– average document length
    """
    # Scaling factor based on document length
    K = k1 * ((1 - b) + b * (dl / avdl))

    # Relevance component
    first = math.log(((r + 0.5) / (N - r + 0.5)) /
                     ((n - r + 0.5) / (N - n - (N - r) + 0.5)))
    second = ((k1 + 1) * f) / (K + f)
    third = ((k2 + 1) * qf) / (k2 + qf)

    return first * second * third

Indexing: The Heavy Lifting

The Inverted Index works like a textbook index: each term maps to a list of document IDs where it appears.

Simplified Indexing Process (Python)

from collections import defaultdict

inverted_index = defaultdict(list)

for doc_id, text in corpus.items():
    tokens = preprocess(text)          # strip punctuation, lowercase, etc.
    for term in tokens:
        inverted_index[term].append(doc_id)

Trade‑off:

Pro: In‑memory index → sub‑millisecond lookups.
Con: High RAM usage; acceptable for

fetch('response.json')
    .then(res => res.json())
    .then(data => {
        const resultsDiv = document.getElementById('results');
        resultsDiv.innerHTML = ''; // Clear old results

        data.forEach(paper => {
            const item = `
                <h3>${paper.title}</h3>
                <p>${paper.abstract}</p>
            `;
            resultsDiv.innerHTML += item;
        });
    })
    .catch(err => console.error('Search error:', err));

Try It Yourself

The full source code is open‑sourced. Clone the repository, follow the README instructions, and run the application locally to explore the indexing and ranking pipeline.

Inside a Scholarly Search Engine: Indexing, Ranking, and Retrieval

What Are We Actually Building?

The Stack

The Secret Sauce

The “Aha!” Moment: Why Simple Counts Do Not Work

BM25 Scoring Function (Python)

Indexing: The Heavy Lifting

Simplified Indexing Process (Python)

Try It Yourself

Related posts

Build a Professional Real-Time Chat App with Docker, Flask, and Socket.IO

Day 6 – Build a Fixed Transport Route & Fare Reference System in Python

You Know Python Basics—Now Let's Build Something Real

🎉 Big News for Python Developers & Mermaid Fans: 'mmdc' Makes Mermaid Diagrams Easy as Python! 🚀