[Paper] Grading Handwritten Engineering Exams with Multimodal Large Language Models

Published: 1 month ago (January 2, 2026 at 11:10 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.00730v1

Overview

Grading handwritten engineering exams has long been a bottleneck: students’ free‑form sketches, equations, and circuit diagrams are hard for computers to interpret, and manual marking doesn’t scale. The new paper introduces an end‑to‑end workflow that leverages multimodal large language models (LLMs) to automatically grade scanned handwritten quizzes while preserving the traditional paper‑based exam format. By requiring only a handwritten reference solution and a concise rule set from the lecturer, the system can produce reliable, auditable grades with minimal human oversight.

Key Contributions

Fully multimodal grading pipeline that accepts raw A4 scans (handwriting, drawings, schematics) and outputs machine‑parseable grade reports.
Reference‑grounded prompting: the lecturer’s handwritten solution is transformed into a text summary that conditions the LLM without exposing the original scan, ensuring privacy and reproducibility.
Robust multi‑stage design: includes a format/presence check, an ensemble of independent graders, a supervisor‑level aggregation step, and deterministic templates that guarantee auditability.
Empirical evaluation on a real Slovenian engineering quiz (including hand‑drawn circuit diagrams) showing an average absolute grading error of ~8 points on a 40‑point scale.
Ablation study demonstrating that naïve prompting or omitting the reference solution dramatically worsens accuracy and introduces systematic over‑grading.

Methodology

Scanning & Pre‑processing – Students’ answer sheets are digitized as high‑resolution images. A lightweight OCR/vision model extracts text blocks and detects hand‑drawn elements (e.g., circuit symbols).
Reference Summarization – The lecturer provides a handwritten “perfect” answer. A separate multimodal LLM converts this scan into a concise textual summary (the reference prompt).
Grading Prompt Construction – For each student answer, the system builds a structured prompt that includes:
- The extracted text and diagram descriptors.
- The grading rubric supplied by the lecturer.
- The reference summary (as conditioning information).
Ensemble Grading – Multiple independent LLM instances (e.g., GPT‑5.2, Gemini‑3 Pro) evaluate the same prompt, each producing a raw score and a justification.
Supervisor Aggregation – A higher‑level model reconciles the ensemble outputs, applies deterministic validation rules (e.g., “score must be integer between 0‑40”), and flags ambiguous cases for human review.
Report Generation – Final grades and rationales are emitted in a fixed JSON schema, enabling downstream analytics and audit trails.

The entire pipeline is “frozen” during evaluation: no fine‑tuning or parameter updates are performed, which mirrors a realistic deployment scenario.

Results & Findings

Mean Absolute Difference (MAD): ≈ 8 points on a 40‑point exam (≈ 20 % error), with negligible systematic bias (average over‑/under‑grading < 0.5 points).
Manual‑Review Trigger Rate: Only ~17 % of submissions required human intervention under a strict maximum‑difference threshold (Dₘₐₓ = 40).
Ablation Insights:
- Removing the reference summary increased MAD to > 15 points and introduced a consistent +3‑point over‑grade bias.
- Simplifying prompts to a single LLM call (no ensemble) raised error variance and doubled the review trigger rate.
Diagram Handling: The vision component successfully identified key circuit symbols, allowing the LLM to reason about diagram correctness comparable to a human grader.

Practical Implications

Scalable Assessment – Universities and training providers can automate grading for large cohorts without redesigning exams, preserving the familiar pen‑and‑paper workflow.
Rapid Feedback Loops – Automated scores become available within minutes of scanning, enabling timely student feedback and adaptive learning pathways.
Auditability & Transparency – Deterministic templates and JSON reports make it easy to trace each grade back to the underlying LLM reasoning, satisfying accreditation requirements.
Cost Reduction – With only ~17 % of answers needing manual review, institutions can cut grading labor by up to 80 % for open‑ended STEM assessments.
Extensibility – The same pipeline can be adapted to other domains (e.g., physics problem sets, architectural sketches) by swapping the rubric and adjusting the vision preprocessing for domain‑specific symbols.

Limitations & Future Work

Language & Domain Specificity – The current evaluation is on a Slovenian engineering quiz; performance on other languages or highly specialized engineering sub‑fields remains to be validated.
Diagram Complexity – While simple circuit schematics are handled well, more intricate drawings (e.g., multi‑layer PCB layouts) may exceed the current vision module’s capabilities.
Model Access – The pipeline relies on proprietary LLM APIs (GPT‑5.2, Gemini‑3 Pro); reproducibility could be limited for organizations without commercial access.
Human‑in‑the‑Loop Optimization – Future work could explore active learning strategies where the system selectively queries human graders to improve its prompts over time.

Bottom line: By marrying multimodal LLMs with a rigorously engineered grading workflow, this research demonstrates a viable path toward automated, trustworthy assessment of handwritten engineering exams—opening the door for broader adoption of AI‑assisted education at scale.

Authors

Janez Perš
Jon Muhovič
Andrej Košir
Boštjan Murovec

Paper Information

arXiv ID: 2601.00730v1
Categories: cs.CV
Published: January 2, 2026
PDF: Download PDF

[Paper] Grading Handwritten Engineering Exams with Multimodal Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Fusion-SSAT: Unleashing the Potential of Self-supervised Auxiliary Task by Feature Fusion for Generalized Deepfake Detection

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing