[Paper] A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection
Source: arXiv - 2604.16234v1
Overview
The paper presents a lightweight, two‑stage deep‑learning pipeline that can spot cheating behavior in exam rooms from video frames. By pairing a fast object detector (YOLOv8n) with a specialized behavior classifier (RexNet‑150), the authors achieve near‑human accuracy while keeping inference latency low enough for real‑time, large‑scale deployments.
Key Contributions
- Simple two‑stage architecture: Detects each student with YOLOv8n, then classifies the cropped region as “normal” or “cheating” using a fine‑tuned RexNet‑150 model.
- Large, heterogeneous dataset: 273 k+ samples gathered from 10 independent sources, providing diverse lighting, camera angles, and classroom layouts.
- High performance: 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1‑score—about a 13 % boost over a strong video‑based baseline.
- Real‑time inference: Average processing time of 13.9 ms per frame, enabling live monitoring of thousands of seats simultaneously.
- Ethical delivery mechanism: Results are sent privately to each student (e.g., via email) to avoid public shaming and give individuals a chance to reflect.
Methodology
-
Stage 1 – Student Localization
- Uses YOLOv8n (the “nano” version of YOLOv8) for its balance of speed and detection quality.
- The model outputs bounding boxes around every person in an exam‑room image.
-
Stage 2 – Behavior Classification
- Each bounding box is cropped, resized, and normalized.
- A RexNet‑150 network—pre‑trained on ImageNet and then fine‑tuned on the cheating dataset—classifies the crop as normal or cheating.
- The two stages are decoupled, so improvements to either detector or classifier can be swapped in without retraining the whole pipeline.
-
Training & Validation
- The combined dataset is split into training/validation/test sets, preserving source diversity.
- Standard augmentations (random flips, brightness jitter) help the classifier generalize to unseen camera conditions.
-
Deployment Considerations
- The pipeline runs on commodity GPUs or even high‑end CPUs, thanks to the low‑parameter YOLOv8n and the efficient RexNet‑150 backbone.
- Inference is performed per frame; temporal smoothing (e.g., majority vote over consecutive frames) is left as a future enhancement.
Results & Findings
| Metric | Value |
|---|---|
| Accuracy | 0.95 |
| Recall (cheating detection) | 0.94 |
| Precision (cheating detection) | 0.96 |
| F1‑Score | 0.95 |
| Avg. inference time | 13.9 ms / sample |
- The system outperforms a strong video‑based baseline (0.82 accuracy) by 13 %, confirming that a well‑designed two‑stage image pipeline can rival more complex multi‑modal solutions.
- High precision means false accusations are rare, a crucial factor for maintaining trust in academic settings.
- The low latency demonstrates feasibility for real‑time monitoring of large lecture halls or remote proctoring setups.
Practical Implications
- Scalable Proctoring: Universities can deploy the model on existing surveillance infrastructure without needing expensive dedicated hardware.
- Cost Reduction: Automating detection cuts down on human invigilator hours and reduces the need for manual video review.
- Integration Friendly: Because the stages are modular, developers can replace YOLOv8n with a newer detector or swap RexNet‑150 for a transformer‑based classifier as they become available.
- Privacy‑First Workflow: The private‑by‑design reporting mechanism aligns with GDPR‑like regulations, making it easier for institutions to adopt AI‑assisted monitoring without legal pushback.
- Open‑Source Foundations: The authors release code and model weights, enabling rapid prototyping and community‑driven improvements (e.g., adding audio cues or multi‑frame temporal analysis).
Limitations & Future Work
- Single‑frame focus: The current system ignores temporal cues that could catch subtle cheating patterns spread across frames.
- Audio exclusion: No sound analysis; integrating microphone data could detect whispering or device usage.
- Dataset bias: Although diverse, the dataset may still under‑represent certain classroom layouts or cultural contexts, potentially affecting generalization.
- Explainability: The black‑box nature of deep classifiers isn’t addressed; future work could add visual explanations (e.g., Grad‑CAM) to help invigilators understand why a flag was raised.
By tackling these points, the framework could evolve into a comprehensive, multimodal exam integrity platform that balances accuracy, speed, and ethical responsibility.
Authors
- Van-Truong Le
- Le-Khanh Nguyen
- Trong-Doanh Nguyen
Paper Information
- arXiv ID: 2604.16234v1
- Categories: cs.CV, cs.AI
- Published: April 17, 2026
- PDF: Download PDF