Self-hosted Plagiarism Detection with OpenSearch

Published: 3 months ago (January 31, 2026 at 12:26 AM EST)

1 min read

Source: Dev.to

Source: Dev.to

Two‑stage approach

Find candidates with `more_like_this`

search = cls.search().filter(
    "nested", path="answers",
    query={"term": {"answers.question_id": str(question_id)}}
)
search = search.exclude("term", user_id=user_id)
search = search.query(
    "nested",
    path="answers",
    query={
        "more_like_this": {
            "fields": ["answers.answer"],
            "like": text,
            "min_term_freq": 1,
            "minimum_should_match": "1%",
        }
    },
)
response = search.execute()

Re‑rank with character n‑grams

def normalize(t):
    return re.sub(r"\s+", "", t.strip())

def char_ngrams(t, n=3):
    return set(t[i:i+n] for i in range(len(t)-n+1))

norm_text = normalize(text)
text_ngrams = char_ngrams(norm_text)

for hit in response.hits:
    norm_answer = normalize(hit.answer)
    answer_ngrams = char_ngrams(norm_answer)

    intersection = len(text_ngrams & answer_ngrams)
    union = len(text_ngrams | answer_ngrams)
    ratio = int((intersection / union) * 100)

    if ratio >= 60:
        # flag as similar

Works decently—about 60 % similarity threshold was found through trial and error.
Self‑hosted, simple operations, and it reuses the existing search infrastructure.

The full implementation

https://github.com/cobel1024/minima

Self-hosted Plagiarism Detection with OpenSearch

Two‑stage approach

Find candidates with `more_like_this`

Re‑rank with character n‑grams

The full implementation

Related posts

Python 3.13 and 3.14 are now available

Problem 12: Find Pairs with Target Sum

The Secret Life of Python: The Hidden Return

📨 Build an Email Validation Tool with Python & Tkinter (Step-by-Step)

Two‑stage approach

Find candidates with more_like_this

Re‑rank with character n‑grams

The full implementation

Related posts

Python 3.13 and 3.14 are now available

Problem 12: Find Pairs with Target Sum

The Secret Life of Python: The Hidden Return

📨 Build an Email Validation Tool with Python & Tkinter (Step-by-Step)

Find candidates with `more_like_this`