深入学术搜索引擎：索引、排序与检索

发布: 0个月前 (2026年1月11日 GMT+8 15:35)

3 分钟阅读

原文: Dev.to

Source: Dev.to

我们到底在构建什么？

为了构建一个功能完整的学术搜索引擎，我们超越了简单的数据库查询，实现了 倒排索引——大多数现代搜索引擎背后的核心数据结构。

技术栈

核心： Python（逻辑和数据结构）
Web 框架： Flask（API 与 UI）
前端： HTML、CSS 与原生 JavaScript（轻量、单体）

秘密武器

自定义的倒排索引搭配 BM25 排序算法。

“啊哈！”时刻：为什么单纯计数行不通

仅凭词频进行排序会偏向那些文字冗长的文档。
BM25 通过以下方式解决这个问题：

惩罚常见词（逆文档频率）。
对文档长度进行归一化（长度归一化）。

BM25 评分函数（Python）

import math

def score_bm25(n, f, qf, r, N, dl, avdl, k1=1.5, b=0.75, k2=100):
    """
    n   – number of documents containing the term
    f   – term frequency in the document
    qf  – term frequency in the query
    r   – number of relevant documents containing the term
    N   – total number of documents
    dl  – document length
    avdl– average document length
    """
    # Scaling factor based on document length
    K = k1 * ((1 - b) + b * (dl / avdl))

    # Relevance component
    first = math.log(((r + 0.5) / (N - r + 0.5)) /
                     ((n - r + 0.5) / (N - n - (N - r) + 0.5)))
    second = ((k1 + 1) * f) / (K + f)
    third = ((k2 + 1) * qf) / (k2 + qf)

    return first * second * third

索引构建：繁重的工作

倒排索引的工作方式类似教材的索引：每个词对应一个出现该词的文档 ID 列表。

简化的索引过程（Python）

from collections import defaultdict

inverted_index = defaultdict(list)

for doc_id, text in corpus.items():
    tokens = preprocess(text)          # strip punctuation, lowercase, etc.
    for term in tokens:
        inverted_index[term].append(doc_id)

权衡：

优点： 内存索引 → 子毫秒级查询。
缺点： 高内存占用；在以下场景下仍可接受

fetch('response.json')
    .then(res => res.json())
    .then(data => {
        const resultsDiv = document.getElementById('results');
        resultsDiv.innerHTML = ''; // Clear old results

        data.forEach(paper => {
            const item = `
                <h3>${paper.title}</h3>
                <p>${paper.abstract}</p>
            `;
            resultsDiv.innerHTML += item;
        });
    })
    .catch(err => console.error('Search error:', err));

亲自尝试

完整源码已开源。克隆仓库，按照 README 中的说明操作，在本地运行应用，即可探索索引与排序的完整流程。

深入学术搜索引擎：索引、排序与检索

我们到底在构建什么？

技术栈

秘密武器

“啊哈！”时刻：为什么单纯计数行不通

BM25 评分函数（Python）

索引构建：繁重的工作

简化的索引过程（Python）

亲自尝试

相关文章

使用 Docker、Flask 和 Socket.IO 构建专业实时聊天应用

第6天 – 使用 Python 构建固定运输路线和票价参考系统

你已经掌握 Python 基础——现在让我们构建真实项目

🎉 Python 开发者和 Mermaid 爱好者的重大消息：'mmdc' 让 Mermaid 图表像 Python 一样简单！ 🚀