[Paper] On Plagiarism and Software Plagiarism

Published: (January 1, 2026 at 01:40 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.00429v1

Overview

The paper by Rares Folea and Emil Slusanschi dives into the thorny problem of automatically detecting software plagiarism. By dissecting both the technical hurdles and the legal backdrop, the authors introduce Project Martial, an open‑source toolkit that brings state‑of‑the‑art similarity detection to developers, educators, and companies alike.

Key Contributions

  • Comprehensive taxonomy of software plagiarism detection challenges, organized by the type of artifacts (source code, binaries, execution traces, etc.).
  • Survey of existing techniques—fingerprinting, “software birthmarks,” and modern code‑embedding models—highlighting strengths and gaps.
  • Design and implementation of Project Martial, an extensible, open‑source platform that integrates multiple detection algorithms under a unified API.
  • Legal and academic context review, summarizing landmark lawsuits and court rulings that shape how copyright law applies to code.
  • Practical guidelines for choosing the right detection strategy based on project size, language diversity, and performance constraints.

Methodology

  1. Literature Mapping – The authors collected and classified prior work on code similarity, ranging from classic token‑based fingerprinting (e.g., winnowing) to recent neural embeddings (e.g., CodeBERT).
  2. Challenge Categorization – They broke down detection problems into four artifact‑based classes:
    • Source‑level (raw text, ASTs)
    • Compiled‑level (bytecode, binaries)
    • Runtime‑level (execution traces, dynamic behavior)
    • Hybrid (combining static and dynamic signals)
  3. Tool Design – Project Martial was built as a modular pipeline:
    • Pre‑processing adapters for different languages and artifact types.
    • Feature extractors implementing fingerprinting, birthmark extraction, and embedding generation.
    • Similarity engines (Jaccard, cosine, graph‑matching) that can be swapped or stacked.
    • Reporting layer that outputs human‑readable diff visualizations and machine‑friendly similarity scores.
  4. Evaluation – The authors benchmarked the toolkit on public plagiarism datasets (e.g., the Google Code Jam “Copy‑Paste” corpus) and on a curated set of real‑world open‑source projects, measuring detection accuracy, false‑positive rates, and runtime performance.

Results & Findings

  • Accuracy: Embedding‑based detectors (CodeBERT‑derived) achieved the highest recall (≈ 92 %) for heavily obfuscated copies, while classic fingerprinting excelled at low‑obfuscation cases with near‑zero false positives.
  • Speed: Fingerprinting pipelines processed ~10 k lines of code per second on commodity hardware, whereas embedding models required GPU acceleration to stay under 1 s per file.
  • Hybrid Approach Wins: Combining a fast fingerprint filter with a slower embedding verifier reduced overall runtime by ~70 % while preserving high detection quality.
  • Legal Insight: The analysis of court cases (e.g., Oracle v. Google, SAS Institute v. World Programming) shows that similarity thresholds used by courts are often far lower than what purely technical tools flag, underscoring the need for contextual interpretation.

Practical Implications

  • Educational Platforms: Instructors can integrate Project Martial into LMSs to automatically flag suspicious submissions, giving students early feedback before formal investigations.
  • Open‑Source Governance: Maintainers can run periodic scans across their repositories to catch inadvertent code reuse that might violate upstream licenses.
  • Enterprise Code Audits: Companies can embed the toolkit in CI/CD pipelines to enforce internal IP policies, catching copy‑and‑paste from proprietary libraries before they ship.
  • Legal Defense/Prosecution: The detailed similarity scores and visual diff reports provide concrete technical evidence that can be presented in copyright disputes.
  • Extensibility: Because the platform is open source and language‑agnostic, developers can plug in custom extractors (e.g., for domain‑specific DSLs) or swap in newer embedding models as they become available.

Limitations & Future Work

  • Dataset Bias: The evaluation relied on publicly available plagiarism corpora, which may not reflect the full spectrum of real‑world obfuscation tactics used in commercial settings.
  • Language Coverage: While the core supports major languages (Java, Python, C/C++), niche or emerging languages lack dedicated parsers and may need community contributions.
  • Legal Nuance: The tool provides similarity metrics but does not interpret legal thresholds; integrating policy engines that map scores to jurisdiction‑specific standards remains an open challenge.
  • Scalability of Embeddings: Large‑scale codebases (millions of files) still strain GPU resources; future work aims at distilling lightweight embedding models or leveraging approximate nearest‑neighbor indexing.

Project Martial positions itself as a bridge between academic research on code similarity and the day‑to‑day needs of developers, educators, and legal teams—making the detection of software plagiarism both more accurate and more actionable.

Authors

  • Rares Folea
  • Emil Slusanschi

Paper Information

  • arXiv ID: 2601.00429v1
  • Categories: cs.SE
  • Published: January 1, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »