[Paper] Can Code Evaluation Metrics Detect Code Plagiarism?

Published: (April 28, 2026 at 11:45 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.25778v1

Overview

The paper Can Code Evaluation Metrics Detect Code Plagiarism? investigates whether metrics originally built to grade automatically‑generated code (e.g., CodeBLEU, CodeBERTScore) can also serve as reliable plagiarism detectors. By benchmarking five such metrics against two well‑known plagiarism tools (JPlag and Dolos) across multiple levels of code modification, the authors show that certain metrics can rival—or even surpass—traditional detectors, especially when preprocessing is applied.

Key Contributions

  • Empirical comparison of five Code Evaluation Metrics (CEMs) with two state‑of‑the‑art plagiarism detection tools on two public plagiarism datasets.
  • Level‑wise analysis (L1–L6) that quantifies how detection performance degrades as code is increasingly obfuscated.
  • Demonstration that preprocessing (e.g., removing boiler‑plate/template code) dramatically improves metric‑based detection, making CrystalBLEU the top performer overall.
  • Ranking‑based evaluation framework that avoids arbitrary similarity thresholds, offering a more nuanced view of detection quality.
  • Open‑source reproducibility: all experiments are built on publicly available datasets (ConPlag raw/template‑free, IRPlag) and code.

Methodology

  1. Datasets – Two labeled plagiarism corpora were used:

    • ConPlag (both raw and a “template‑free” version where common scaffolding was stripped).
    • IRPlag, another benchmark with known plagiarism pairs.
      Each pair is annotated with a modification level (L1 = minimal changes, up to L6 = heavy rewrites).
  2. Metrics Evaluated

    • CodeBLEU – BLEU‑style n‑gram overlap plus syntax and data‑flow components.
    • CrystalBLEU – A BLEU variant that incorporates abstract syntax tree (AST) tokenization.
    • RUBY – A token‑level similarity score that also weighs structural similarity.
    • Tree‑Structured Edit Distance (TSED) – Direct edit‑distance on ASTs.
    • CodeBERTScore – Embedding‑based similarity using a pretrained CodeBERT model.
  3. Baseline ToolsJPlag (token‑based) and Dolos (graph‑based).

  4. Evaluation Protocol

    • No fixed similarity threshold; instead, each method produces a similarity score for every candidate pair.
    • Scores are sorted, and ranking‑based metrics (e.g., Mean Reciprocal Rank, Normalized Discounted Cumulative Gain) measure how high true plagiarism pairs appear in the list.
    • Analyses are performed overall, per dataset, and per modification level.
  5. Preprocessing – For the “template‑free” experiments, common boiler‑plate code (e.g., import statements, main function skeletons) is stripped before scoring.

Results & Findings

ScenarioBest Overall PerformerNotable Observations
Raw datasets, no preprocessingDolos (overall ranking)Traditional graph‑based tool still leads when code contains a lot of scaffolding.
Metric‑only (no preprocessing)CrystalBLEU, CodeBLEU, RUBY outperform JPlagMetric‑based approaches can beat a classic token‑based detector.
With preprocessingCrystalBLEU surpasses DolosRemoving boiler‑plate lets the AST‑aware BLEU variant shine.
Per‑level performanceStrong at L1–L3, drops after L4All methods struggle with heavily obfuscated code, but CrystalBLEU remains competitive even at L6.
Dataset‑specificDolos best on ConPlag‑raw; CrystalBLEU best on ConPlag‑template‑free & IRPlagThe choice of tool can depend on dataset characteristics.

In short, code‑evaluation metrics are comparable to dedicated plagiarism detectors, especially when the input is cleaned of irrelevant template code.

Practical Implications

  • Automated grading pipelines can reuse the same similarity metric for both quality assessment and plagiarism detection, reducing engineering overhead.
  • IDE plugins or CI/CD checks could embed a lightweight metric like CrystalBLEU to flag suspicious submissions early, before invoking heavier graph‑based tools.
  • Educational platforms (e.g., MOOCs, university autograders) can benefit from the “threshold‑free” ranking approach, which provides a graded suspicion score rather than a binary yes/no verdict.
  • Open‑source code review tools might adopt these metrics to surface potential copy‑paste incidents across large repositories, complementing existing static analysis.
  • Since preprocessing dramatically boosts performance, pre‑flight sanitization (removing imports, test harnesses, or common starter code) should become a standard step in any plagiarism‑detection workflow.

Limitations & Future Work

  • Scalability: Metrics like CodeBERTScore involve large language models, which can be computationally expensive for massive codebases.
  • Obfuscation resistance: Performance still degrades sharply after level L4; more robust structural or semantic analyses are needed for heavily rewritten code.
  • Language coverage: The study focuses on Java‑like languages; extending to dynamically typed or multi‑paradigm languages may require metric adaptations.
  • Human‑in‑the‑loop validation: Ranking metrics indicate “suspicion” but do not replace expert judgment; integrating explainability (e.g., highlighting changed AST nodes) is an open direction.

Overall, the work opens the door for a unified evaluation framework that can serve both code generation assessment and plagiarism detection, encouraging further research on metric‑driven, language‑agnostic similarity measures.

Authors

  • Fahad Ebrahim
  • Mike Joy

Paper Information

  • arXiv ID: 2604.25778v1
  • Categories: cs.SE, cs.AI, cs.IR
  • Published: April 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...