[Paper] Can Code Evaluation Metrics Detect Code Plagiarism?

Published: 21 hours ago (April 28, 2026 at 11:45 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.25778v1

Overview

The paper Can Code Evaluation Metrics Detect Code Plagiarism? investigates whether metrics originally built to grade automatically‑generated code (e.g., CodeBLEU, CodeBERTScore) can also serve as reliable plagiarism detectors. By benchmarking five such metrics against two well‑known plagiarism tools (JPlag and Dolos) across multiple levels of code modification, the authors show that certain metrics can rival—or even surpass—traditional detectors, especially when preprocessing is applied.

Key Contributions

Empirical comparison of five Code Evaluation Metrics (CEMs) with two state‑of‑the‑art plagiarism detection tools on two public plagiarism datasets.
Level‑wise analysis (L1–L6) that quantifies how detection performance degrades as code is increasingly obfuscated.
Demonstration that preprocessing (e.g., removing boiler‑plate/template code) dramatically improves metric‑based detection, making CrystalBLEU the top performer overall.
Ranking‑based evaluation framework that avoids arbitrary similarity thresholds, offering a more nuanced view of detection quality.
Open‑source reproducibility: all experiments are built on publicly available datasets (ConPlag raw/template‑free, IRPlag) and code.

Methodology

Datasets – Two labeled plagiarism corpora were used:
- ConPlag (both raw and a “template‑free” version where common scaffolding was stripped).
- IRPlag, another benchmark with known plagiarism pairs.
  Each pair is annotated with a modification level (L1 = minimal changes, up to L6 = heavy rewrites).
Metrics Evaluated –
- CodeBLEU – BLEU‑style n‑gram overlap plus syntax and data‑flow components.
- CrystalBLEU – A BLEU variant that incorporates abstract syntax tree (AST) tokenization.
- RUBY – A token‑level similarity score that also weighs structural similarity.
- Tree‑Structured Edit Distance (TSED) – Direct edit‑distance on ASTs.
- CodeBERTScore – Embedding‑based similarity using a pretrained CodeBERT model.
Baseline Tools – JPlag (token‑based) and Dolos (graph‑based).
Evaluation Protocol –
- No fixed similarity threshold; instead, each method produces a similarity score for every candidate pair.
- Scores are sorted, and ranking‑based metrics (e.g., Mean Reciprocal Rank, Normalized Discounted Cumulative Gain) measure how high true plagiarism pairs appear in the list.
- Analyses are performed overall, per dataset, and per modification level.
Preprocessing – For the “template‑free” experiments, common boiler‑plate code (e.g., import statements, main function skeletons) is stripped before scoring.

Results & Findings

Scenario	Best Overall Performer	Notable Observations
Raw datasets, no preprocessing	Dolos (overall ranking)	Traditional graph‑based tool still leads when code contains a lot of scaffolding.
Metric‑only (no preprocessing)	CrystalBLEU, CodeBLEU, RUBY outperform JPlag	Metric‑based approaches can beat a classic token‑based detector.
With preprocessing	CrystalBLEU surpasses Dolos	Removing boiler‑plate lets the AST‑aware BLEU variant shine.
Per‑level performance	Strong at L1–L3, drops after L4	All methods struggle with heavily obfuscated code, but CrystalBLEU remains competitive even at L6.
Dataset‑specific	Dolos best on ConPlag‑raw; CrystalBLEU best on ConPlag‑template‑free & IRPlag	The choice of tool can depend on dataset characteristics.

In short, code‑evaluation metrics are comparable to dedicated plagiarism detectors, especially when the input is cleaned of irrelevant template code.

Practical Implications

Automated grading pipelines can reuse the same similarity metric for both quality assessment and plagiarism detection, reducing engineering overhead.
IDE plugins or CI/CD checks could embed a lightweight metric like CrystalBLEU to flag suspicious submissions early, before invoking heavier graph‑based tools.
Educational platforms (e.g., MOOCs, university autograders) can benefit from the “threshold‑free” ranking approach, which provides a graded suspicion score rather than a binary yes/no verdict.
Open‑source code review tools might adopt these metrics to surface potential copy‑paste incidents across large repositories, complementing existing static analysis.
Since preprocessing dramatically boosts performance, pre‑flight sanitization (removing imports, test harnesses, or common starter code) should become a standard step in any plagiarism‑detection workflow.

Limitations & Future Work

Scalability: Metrics like CodeBERTScore involve large language models, which can be computationally expensive for massive codebases.
Obfuscation resistance: Performance still degrades sharply after level L4; more robust structural or semantic analyses are needed for heavily rewritten code.
Language coverage: The study focuses on Java‑like languages; extending to dynamically typed or multi‑paradigm languages may require metric adaptations.
Human‑in‑the‑loop validation: Ranking metrics indicate “suspicion” but do not replace expert judgment; integrating explainability (e.g., highlighting changed AST nodes) is an open direction.

Overall, the work opens the door for a unified evaluation framework that can serve both code generation assessment and plagiarism detection, encouraging further research on metric‑driven, language‑agnostic similarity measures.

Authors

Fahad Ebrahim
Mike Joy

Paper Information

arXiv ID: 2604.25778v1
Categories: cs.SE, cs.AI, cs.IR
Published: April 28, 2026
PDF: Download PDF

[Paper] Can Code Evaluation Metrics Detect Code Plagiarism?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

[Paper] Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

[Paper] Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models