[Paper] A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection

Published: (April 17, 2026 at 12:53 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.16234v1

Overview

The paper presents a lightweight, two‑stage deep‑learning pipeline that can spot cheating behavior in exam rooms from video frames. By pairing a fast object detector (YOLOv8n) with a specialized behavior classifier (RexNet‑150), the authors achieve near‑human accuracy while keeping inference latency low enough for real‑time, large‑scale deployments.

Key Contributions

  • Simple two‑stage architecture: Detects each student with YOLOv8n, then classifies the cropped region as “normal” or “cheating” using a fine‑tuned RexNet‑150 model.
  • Large, heterogeneous dataset: 273 k+ samples gathered from 10 independent sources, providing diverse lighting, camera angles, and classroom layouts.
  • High performance: 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1‑score—about a 13 % boost over a strong video‑based baseline.
  • Real‑time inference: Average processing time of 13.9 ms per frame, enabling live monitoring of thousands of seats simultaneously.
  • Ethical delivery mechanism: Results are sent privately to each student (e.g., via email) to avoid public shaming and give individuals a chance to reflect.

Methodology

  1. Stage 1 – Student Localization

    • Uses YOLOv8n (the “nano” version of YOLOv8) for its balance of speed and detection quality.
    • The model outputs bounding boxes around every person in an exam‑room image.
  2. Stage 2 – Behavior Classification

    • Each bounding box is cropped, resized, and normalized.
    • A RexNet‑150 network—pre‑trained on ImageNet and then fine‑tuned on the cheating dataset—classifies the crop as normal or cheating.
    • The two stages are decoupled, so improvements to either detector or classifier can be swapped in without retraining the whole pipeline.
  3. Training & Validation

    • The combined dataset is split into training/validation/test sets, preserving source diversity.
    • Standard augmentations (random flips, brightness jitter) help the classifier generalize to unseen camera conditions.
  4. Deployment Considerations

    • The pipeline runs on commodity GPUs or even high‑end CPUs, thanks to the low‑parameter YOLOv8n and the efficient RexNet‑150 backbone.
    • Inference is performed per frame; temporal smoothing (e.g., majority vote over consecutive frames) is left as a future enhancement.

Results & Findings

MetricValue
Accuracy0.95
Recall (cheating detection)0.94
Precision (cheating detection)0.96
F1‑Score0.95
Avg. inference time13.9 ms / sample
  • The system outperforms a strong video‑based baseline (0.82 accuracy) by 13 %, confirming that a well‑designed two‑stage image pipeline can rival more complex multi‑modal solutions.
  • High precision means false accusations are rare, a crucial factor for maintaining trust in academic settings.
  • The low latency demonstrates feasibility for real‑time monitoring of large lecture halls or remote proctoring setups.

Practical Implications

  • Scalable Proctoring: Universities can deploy the model on existing surveillance infrastructure without needing expensive dedicated hardware.
  • Cost Reduction: Automating detection cuts down on human invigilator hours and reduces the need for manual video review.
  • Integration Friendly: Because the stages are modular, developers can replace YOLOv8n with a newer detector or swap RexNet‑150 for a transformer‑based classifier as they become available.
  • Privacy‑First Workflow: The private‑by‑design reporting mechanism aligns with GDPR‑like regulations, making it easier for institutions to adopt AI‑assisted monitoring without legal pushback.
  • Open‑Source Foundations: The authors release code and model weights, enabling rapid prototyping and community‑driven improvements (e.g., adding audio cues or multi‑frame temporal analysis).

Limitations & Future Work

  • Single‑frame focus: The current system ignores temporal cues that could catch subtle cheating patterns spread across frames.
  • Audio exclusion: No sound analysis; integrating microphone data could detect whispering or device usage.
  • Dataset bias: Although diverse, the dataset may still under‑represent certain classroom layouts or cultural contexts, potentially affecting generalization.
  • Explainability: The black‑box nature of deep classifiers isn’t addressed; future work could add visual explanations (e.g., Grad‑CAM) to help invigilators understand why a flag was raised.

By tackling these points, the framework could evolve into a comprehensive, multimodal exam integrity platform that balances accuracy, speed, and ethical responsibility.

Authors

  • Van-Truong Le
  • Le-Khanh Nguyen
  • Trong-Doanh Nguyen

Paper Information

  • arXiv ID: 2604.16234v1
  • Categories: cs.CV, cs.AI
  • Published: April 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »