[Paper] Grading Handwritten Engineering Exams with Multimodal Large Language Models

Published: (January 2, 2026 at 11:10 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.00730v1

Overview

Grading handwritten engineering exams has long been a bottleneck: students’ free‑form sketches, equations, and circuit diagrams are hard for computers to interpret, and manual marking doesn’t scale. The new paper introduces an end‑to‑end workflow that leverages multimodal large language models (LLMs) to automatically grade scanned handwritten quizzes while preserving the traditional paper‑based exam format. By requiring only a handwritten reference solution and a concise rule set from the lecturer, the system can produce reliable, auditable grades with minimal human oversight.

Key Contributions

  • Fully multimodal grading pipeline that accepts raw A4 scans (handwriting, drawings, schematics) and outputs machine‑parseable grade reports.
  • Reference‑grounded prompting: the lecturer’s handwritten solution is transformed into a text summary that conditions the LLM without exposing the original scan, ensuring privacy and reproducibility.
  • Robust multi‑stage design: includes a format/presence check, an ensemble of independent graders, a supervisor‑level aggregation step, and deterministic templates that guarantee auditability.
  • Empirical evaluation on a real Slovenian engineering quiz (including hand‑drawn circuit diagrams) showing an average absolute grading error of ~8 points on a 40‑point scale.
  • Ablation study demonstrating that naïve prompting or omitting the reference solution dramatically worsens accuracy and introduces systematic over‑grading.

Methodology

  1. Scanning & Pre‑processing – Students’ answer sheets are digitized as high‑resolution images. A lightweight OCR/vision model extracts text blocks and detects hand‑drawn elements (e.g., circuit symbols).
  2. Reference Summarization – The lecturer provides a handwritten “perfect” answer. A separate multimodal LLM converts this scan into a concise textual summary (the reference prompt).
  3. Grading Prompt Construction – For each student answer, the system builds a structured prompt that includes:
    • The extracted text and diagram descriptors.
    • The grading rubric supplied by the lecturer.
    • The reference summary (as conditioning information).
  4. Ensemble Grading – Multiple independent LLM instances (e.g., GPT‑5.2, Gemini‑3 Pro) evaluate the same prompt, each producing a raw score and a justification.
  5. Supervisor Aggregation – A higher‑level model reconciles the ensemble outputs, applies deterministic validation rules (e.g., “score must be integer between 0‑40”), and flags ambiguous cases for human review.
  6. Report Generation – Final grades and rationales are emitted in a fixed JSON schema, enabling downstream analytics and audit trails.

The entire pipeline is “frozen” during evaluation: no fine‑tuning or parameter updates are performed, which mirrors a realistic deployment scenario.

Results & Findings

  • Mean Absolute Difference (MAD): ≈ 8 points on a 40‑point exam (≈ 20 % error), with negligible systematic bias (average over‑/under‑grading < 0.5 points).
  • Manual‑Review Trigger Rate: Only ~17 % of submissions required human intervention under a strict maximum‑difference threshold (Dₘₐₓ = 40).
  • Ablation Insights:
    • Removing the reference summary increased MAD to > 15 points and introduced a consistent +3‑point over‑grade bias.
    • Simplifying prompts to a single LLM call (no ensemble) raised error variance and doubled the review trigger rate.
  • Diagram Handling: The vision component successfully identified key circuit symbols, allowing the LLM to reason about diagram correctness comparable to a human grader.

Practical Implications

  • Scalable Assessment – Universities and training providers can automate grading for large cohorts without redesigning exams, preserving the familiar pen‑and‑paper workflow.
  • Rapid Feedback Loops – Automated scores become available within minutes of scanning, enabling timely student feedback and adaptive learning pathways.
  • Auditability & Transparency – Deterministic templates and JSON reports make it easy to trace each grade back to the underlying LLM reasoning, satisfying accreditation requirements.
  • Cost Reduction – With only ~17 % of answers needing manual review, institutions can cut grading labor by up to 80 % for open‑ended STEM assessments.
  • Extensibility – The same pipeline can be adapted to other domains (e.g., physics problem sets, architectural sketches) by swapping the rubric and adjusting the vision preprocessing for domain‑specific symbols.

Limitations & Future Work

  • Language & Domain Specificity – The current evaluation is on a Slovenian engineering quiz; performance on other languages or highly specialized engineering sub‑fields remains to be validated.
  • Diagram Complexity – While simple circuit schematics are handled well, more intricate drawings (e.g., multi‑layer PCB layouts) may exceed the current vision module’s capabilities.
  • Model Access – The pipeline relies on proprietary LLM APIs (GPT‑5.2, Gemini‑3 Pro); reproducibility could be limited for organizations without commercial access.
  • Human‑in‑the‑Loop Optimization – Future work could explore active learning strategies where the system selectively queries human graders to improve its prompts over time.

Bottom line: By marrying multimodal LLMs with a rigorously engineered grading workflow, this research demonstrates a viable path toward automated, trustworthy assessment of handwritten engineering exams—opening the door for broader adoption of AI‑assisted education at scale.

Authors

  • Janez Perš
  • Jon Muhovič
  • Andrej Košir
  • Boštjan Murovec

Paper Information

  • arXiv ID: 2601.00730v1
  • Categories: cs.CV
  • Published: January 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »