Inside a Scholarly Search Engine: Indexing, Ranking, and Retrieval

Published: (January 11, 2026 at 02:35 AM EST)
2 min read
Source: Dev.to

Source: Dev.to

What Are We Actually Building?

To build a functional scholarly search engine we moved beyond simple database queries and implemented an Inverted Index, the core data structure behind most modern search engines.

The Stack

  • Core: Python (logic and data structures)
  • Web Framework: Flask (API and UI)
  • Frontend: HTML, CSS & vanilla JavaScript (lightweight, monolithic)

The Secret Sauce

A custom‑built Inverted Index paired with a BM25 ranking algorithm.

The “Aha!” Moment: Why Simple Counts Do Not Work

Ranking purely by term frequency favors documents that are merely wordy.
BM25 addresses this by:

  1. Penalizing common words (Inverse Document Frequency).
  2. Normalizing for document length (Length Normalization).

BM25 Scoring Function (Python)

import math

def score_bm25(n, f, qf, r, N, dl, avdl, k1=1.5, b=0.75, k2=100):
    """
    n   – number of documents containing the term
    f   – term frequency in the document
    qf  – term frequency in the query
    r   – number of relevant documents containing the term
    N   – total number of documents
    dl  – document length
    avdl– average document length
    """
    # Scaling factor based on document length
    K = k1 * ((1 - b) + b * (dl / avdl))

    # Relevance component
    first = math.log(((r + 0.5) / (N - r + 0.5)) /
                     ((n - r + 0.5) / (N - n - (N - r) + 0.5)))
    second = ((k1 + 1) * f) / (K + f)
    third = ((k2 + 1) * qf) / (k2 + qf)

    return first * second * third

Indexing: The Heavy Lifting

The Inverted Index works like a textbook index: each term maps to a list of document IDs where it appears.

Simplified Indexing Process (Python)

from collections import defaultdict

inverted_index = defaultdict(list)

for doc_id, text in corpus.items():
    tokens = preprocess(text)          # strip punctuation, lowercase, etc.
    for term in tokens:
        inverted_index[term].append(doc_id)

Trade‑off:

  • Pro: In‑memory index → sub‑millisecond lookups.
  • Con: High RAM usage; acceptable for
fetch('response.json')
    .then(res => res.json())
    .then(data => {
        const resultsDiv = document.getElementById('results');
        resultsDiv.innerHTML = ''; // Clear old results

        data.forEach(paper => {
            const item = `
                <h3>${paper.title}</h3>
                <p>${paper.abstract}</p>
            `;
            resultsDiv.innerHTML += item;
        });
    })
    .catch(err => console.error('Search error:', err));

Try It Yourself

The full source code is open‑sourced. Clone the repository, follow the README instructions, and run the application locally to explore the indexing and ranking pipeline.

Back to Blog

Related posts

Read more »