[Paper] A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection

Published: 2 days ago (April 17, 2026 at 12:53 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.16234v1

Overview

The paper presents a lightweight, two‑stage deep‑learning pipeline that can spot cheating behavior in exam rooms from video frames. By pairing a fast object detector (YOLOv8n) with a specialized behavior classifier (RexNet‑150), the authors achieve near‑human accuracy while keeping inference latency low enough for real‑time, large‑scale deployments.

Key Contributions

Simple two‑stage architecture: Detects each student with YOLOv8n, then classifies the cropped region as “normal” or “cheating” using a fine‑tuned RexNet‑150 model.
Large, heterogeneous dataset: 273 k+ samples gathered from 10 independent sources, providing diverse lighting, camera angles, and classroom layouts.
High performance: 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1‑score—about a 13 % boost over a strong video‑based baseline.
Real‑time inference: Average processing time of 13.9 ms per frame, enabling live monitoring of thousands of seats simultaneously.
Ethical delivery mechanism: Results are sent privately to each student (e.g., via email) to avoid public shaming and give individuals a chance to reflect.

Methodology

Stage 1 – Student Localization
- Uses YOLOv8n (the “nano” version of YOLOv8) for its balance of speed and detection quality.
- The model outputs bounding boxes around every person in an exam‑room image.
Stage 2 – Behavior Classification
- Each bounding box is cropped, resized, and normalized.
- A RexNet‑150 network—pre‑trained on ImageNet and then fine‑tuned on the cheating dataset—classifies the crop as normal or cheating.
- The two stages are decoupled, so improvements to either detector or classifier can be swapped in without retraining the whole pipeline.
Training & Validation
- The combined dataset is split into training/validation/test sets, preserving source diversity.
- Standard augmentations (random flips, brightness jitter) help the classifier generalize to unseen camera conditions.
Deployment Considerations
- The pipeline runs on commodity GPUs or even high‑end CPUs, thanks to the low‑parameter YOLOv8n and the efficient RexNet‑150 backbone.
- Inference is performed per frame; temporal smoothing (e.g., majority vote over consecutive frames) is left as a future enhancement.

Results & Findings

Metric	Value
Accuracy	0.95
Recall (cheating detection)	0.94
Precision (cheating detection)	0.96
F1‑Score	0.95
Avg. inference time	13.9 ms / sample

The system outperforms a strong video‑based baseline (0.82 accuracy) by 13 %, confirming that a well‑designed two‑stage image pipeline can rival more complex multi‑modal solutions.
High precision means false accusations are rare, a crucial factor for maintaining trust in academic settings.
The low latency demonstrates feasibility for real‑time monitoring of large lecture halls or remote proctoring setups.

Practical Implications

Scalable Proctoring: Universities can deploy the model on existing surveillance infrastructure without needing expensive dedicated hardware.
Cost Reduction: Automating detection cuts down on human invigilator hours and reduces the need for manual video review.
Integration Friendly: Because the stages are modular, developers can replace YOLOv8n with a newer detector or swap RexNet‑150 for a transformer‑based classifier as they become available.
Privacy‑First Workflow: The private‑by‑design reporting mechanism aligns with GDPR‑like regulations, making it easier for institutions to adopt AI‑assisted monitoring without legal pushback.
Open‑Source Foundations: The authors release code and model weights, enabling rapid prototyping and community‑driven improvements (e.g., adding audio cues or multi‑frame temporal analysis).

Limitations & Future Work

Single‑frame focus: The current system ignores temporal cues that could catch subtle cheating patterns spread across frames.
Audio exclusion: No sound analysis; integrating microphone data could detect whispering or device usage.
Dataset bias: Although diverse, the dataset may still under‑represent certain classroom layouts or cultural contexts, potentially affecting generalization.
Explainability: The black‑box nature of deep classifiers isn’t addressed; future work could add visual explanations (e.g., Grad‑CAM) to help invigilators understand why a flag was raised.

By tackling these points, the framework could evolve into a comprehensive, multimodal exam integrity platform that balances accuracy, speed, and ethical responsibility.

Authors

Van-Truong Le
Le-Khanh Nguyen
Trong-Doanh Nguyen

Paper Information

arXiv ID: 2604.16234v1
Categories: cs.CV, cs.AI
Published: April 17, 2026
PDF: Download PDF

[Paper] A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] Information Router for Mitigating Modality Dominance in Vision-Language Models

[Paper] Repurposing 3D Generative Model for Autoregressive Layout Generation

[Paper] FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation