[Paper] Autonomous LLM-generated Feedback for Student Exercises in Introductory Software Engineering Courses

Published: 1 day ago (April 22, 2026 at 01:34 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.20803v1

Overview

The paper presents NAILA, an autonomous feedback system that leverages large language models (LLMs) to grade and comment on student submissions in introductory software‑engineering courses. By offering 24/7, AI‑driven feedback, NAILA aims to alleviate the bottleneck caused by ever‑growing class sizes and limited teaching staff, while still keeping the evaluation aligned with instructor‑defined solutions.

Key Contributions

NAILA prototype: a fully automated pipeline that ingests student artefacts (code, design docs, etc.) and returns structured feedback generated by LLMs.
Prompt engineering framework: specialized templates that translate teacher‑authored model solutions into prompts that guide the LLM to produce consistent, pedagogically sound comments.
Large‑scale field study: deployment with >900 active students at the University of Duisburg‑Essen, covering adoption motives, perceived usefulness, usage patterns, and impact on grades.
Empirical insights: quantitative and qualitative evidence on how AI‑generated feedback compares to traditional human feedback in terms of learning outcomes and student satisfaction.

Methodology

Model‑solution authoring – Instructors create a reference solution for each exercise and annotate it with grading criteria.
Prompt template design – The authors craft a set of prompt templates that embed the reference solution, the rubric, and the student’s submission, instructing the LLM (e.g., GPT‑4) to produce a feedback report.
System integration – NAILA is wrapped in a web interface that accepts common document formats (plain text, PDFs, Jupyter notebooks) and returns the LLM’s output instantly.
Empirical evaluation – Over a semester, students could optionally use NAILA. The researchers collected logs (frequency, duration), survey responses (perceived usefulness, ease of use, self‑reported learning), and academic performance data (grades on the same exercises with human vs. AI feedback). Statistical analyses (ANOVA, regression) were applied to answer the four research questions.

Results & Findings

Motivation: Students who felt time‑pressed or wanted immediate clarification were the strongest adopters; those skeptical about AI accuracy tended to avoid NAILA.
User acceptance: The system scored high on perceived usefulness (average 4.2/5) and ease of use (4.0/5). Learners reported a modest but statistically significant boost in self‑assessed understanding (≈ +0.3 points on a 5‑point Likert scale).
Engagement patterns: On average, students accessed NAILA 2.7 times per week, with usage spikes right before assignment deadlines. Feedback latency was consistently under 30 seconds.
Academic impact: Students who regularly used NAILA achieved marginally higher grades (≈ 2 percentage points) compared to peers relying solely on human TA feedback. The difference persisted after controlling for prior GPA and attendance.

Practical Implications

Scalable tutoring: Universities can deploy NAILA‑like services to extend instructor capacity without hiring additional staff, especially for large introductory courses.
Continuous learning loops: Immediate AI feedback encourages iterative improvement—students can fix mistakes on the fly rather than waiting days for a TA’s comments.
Tool integration: Because NAILA works with open document formats, it can be embedded into existing LMSs (Moodle, Canvas) or IDE plugins, making adoption frictionless for developers and educators alike.
Data‑driven curriculum tweaks: Aggregated feedback logs reveal common misconceptions, enabling instructors to adjust lecture material or create targeted remedial content.
Cost‑effectiveness: Leveraging pay‑per‑token LLM APIs can be cheaper than scaling human grading staff, especially when combined with caching of repeated prompts for similar solutions.

Limitations & Future Work

LLM reliability: The system occasionally produced overly generic or even incorrect feedback, especially for edge‑case code patterns not covered by the prompt templates.
Domain scope: The study focused on introductory SE topics; extending to advanced algorithms or system‑level design may require richer prompts and more domain‑specific fine‑tuning.
Student bias: Self‑selection (students opting into NAILA) could confound the observed grade improvements; a randomized controlled trial would strengthen causal claims.
Ethical considerations: Relying on AI feedback raises questions about academic honesty and over‑dependence on black‑box tools—future work should explore transparency mechanisms (e.g., showing the LLM’s reasoning trace).

Bottom line: NAILA demonstrates that LLM‑powered, on‑demand feedback can meaningfully augment traditional teaching in high‑enrollment software engineering courses, offering a practical blueprint for institutions looking to harness generative AI for scalable education.

Authors

Andreas Metzger

Paper Information

arXiv ID: 2604.20803v1
Categories: cs.SE
Published: April 22, 2026
PDF: Download PDF

[Paper] Autonomous LLM-generated Feedback for Student Exercises in Introductory Software Engineering Courses

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Autark: A Serverless Toolkit for Prototyping Urban Visual Analytics Systems

[Paper] Evaluating Software Defect Prediction Models via the Area Under the ROC Curve Can Be Misleading

[Paper] DeepParse: Hybrid Log Parsing with LLM-Synthesized Regex Masks

[Paper] On the Informativeness of Security Commit Messages: A Large-scale Replication Study