[Paper] Personalized Worked Example Generation from Student Code Submissions using Pattern-based Knowledge Components
Source: arXiv - 2604.24758v1
Overview
A new study shows how to automatically generate personalized worked examples for programming students by mining the patterns hidden in their own code submissions. By extracting “knowledge components” (KCs) directly from student programs and feeding them into a generative AI model, the system produces explanations that target the exact misconceptions a learner is grappling with—without requiring a massive hand‑crafted library of examples.
Key Contributions
- Pattern‑based KC extraction: Introduces an AST‑driven pipeline that discovers recurring structural concepts (e.g., loop patterns, recursion templates) from a batch of student submissions.
- KC‑conditioned generation: Couples the extracted KCs with a large‑language model (LLM) to steer the generation of worked examples toward the learner’s specific logical errors.
- Empirical validation: Conducts a blind expert evaluation comparing vanilla LLM outputs with KC‑conditioned outputs, demonstrating measurable gains in topical focus and relevance.
- Scalable personalization framework: Provides a reusable architecture that can be plugged into existing programming tutoring platforms, reducing the manual effort needed to maintain example libraries.
Methodology
- Collect student code for a given programming exercise (e.g., implementing a binary search).
- Parse each submission into an Abstract Syntax Tree (AST). The AST makes structural elements—loops, conditionals, function calls—explicit and language‑agnostic.
- Cluster recurring sub‑trees across all submissions. Each cluster represents a knowledge component (KC), such as “off‑by‑one in loop bounds” or “missing base case in recursion.”
- Annotate the problem statement with the KCs that appear most frequently in a particular student’s code.
- Prompt a generative model (e.g., GPT‑4) with a template that includes:
- The original problem description
- The student’s code snippet
- The list of relevant KCs
- A request to produce a worked example that explicitly addresses those KCs.
- Expert evaluation: Two experienced CS educators rate the generated examples on relevance, correctness, and pedagogical clarity, blind to whether the example came from the baseline or KC‑conditioned pipeline.
Results & Findings
| Metric (1‑5 scale) | Baseline LLM | KC‑Conditioned LLM |
|---|---|---|
| Topical relevance | 3.2 | 4.1 |
| Alignment with error | 2.9 | 4.0 |
| Overall pedagogical quality | 3.5 | 4.2 |
What it means:
- Higher relevance: The KC‑conditioned examples directly tackled the specific mistake (e.g., “your loop stops one iteration early”), whereas baseline examples often drifted to generic solutions.
- Better error alignment: Reviewers noted that the KC‑steered outputs explicitly named the problematic pattern, making it easier for students to map the explanation to their own code.
- Consistent quality: No drop in correctness or readability was observed, indicating that the added conditioning does not compromise the model’s language abilities.
Practical Implications
- Reduced authoring workload: Instructors no longer need to write dozens of bespoke examples for each common mistake; the system auto‑generates them on demand.
- Real‑time feedback: Integrated into IDE plugins or online judges, the pipeline can produce a tailored worked example instantly after a student’s failed submission.
- Scalable tutoring platforms: MOOCs, bootcamps, and corporate training portals can personalize practice at scale, improving learner retention without hiring additional teaching assistants.
- Data‑driven curriculum design: By analyzing which KCs surface most often, educators can identify curriculum gaps and prioritize new instructional material.
Limitations & Future Work
- Domain specificity: The current implementation focuses on relatively small, well‑structured assignments (e.g., loops, recursion). Extending to larger projects or multi‑file codebases may require more sophisticated KC hierarchies.
- Model dependence: The quality of generated examples hinges on the underlying LLM; biases or hallucinations in the model could propagate into the tutoring content.
- Evaluation scope: Expert ratings were limited to a handful of problems and reviewers. Larger‑scale user studies (e.g., A/B testing with actual learners) are needed to confirm learning gains.
- Future directions: The authors plan to (1) incorporate dynamic execution traces to enrich KC extraction, (2) explore multimodal explanations (e.g., visualizations, step‑by‑step debuggers), and (3) automate the continual update of KC libraries as new cohorts of students submit code.
Authors
- Griffin Pitts
- Muntasir Hoq
- Peter Brusilovsky
- Narges Norouzi
- Arto Hellas
- Juho Leinonen
- Bita Akram
Paper Information
- arXiv ID: 2604.24758v1
- Categories: cs.HC, cs.AI, cs.CY, cs.ET, cs.LG
- Published: April 27, 2026
- PDF: Download PDF