使用 OpenSearch 的自托管抄袭检测
发布: (2026年1月31日 GMT+8 13:26)
1 min read
原文: Dev.to
Source: Dev.to
两阶段方法
使用 more_like_this 查找候选项
search = cls.search().filter(
"nested", path="answers",
query={"term": {"answers.question_id": str(question_id)}}
)
search = search.exclude("term", user_id=user_id)
search = search.query(
"nested",
path="answers",
query={
"more_like_this": {
"fields": ["answers.answer"],
"like": text,
"min_term_freq": 1,
"minimum_should_match": "1%",
}
},
)
response = search.execute()
使用字符 n‑gram 重新排序
def normalize(t):
return re.sub(r"\s+", "", t.strip())
def char_ngrams(t, n=3):
return set(t[i:i+n] for i in range(len(t)-n+1))
norm_text = normalize(text)
text_ngrams = char_ngrams(norm_text)
for hit in response.hits:
norm_answer = normalize(hit.answer)
answer_ngrams = char_ngrams(norm_answer)
intersection = len(text_ngrams & answer_ngrams)
union = len(text_ngrams | answer_ngrams)
ratio = int((intersection / union) * 100)
if ratio >= 60:
# flag as similar
效果尚可——大约 60 % 的相似度阈值是通过反复试验得出的。
自托管、操作简单,并且复用了现有的搜索基础设施。