[Paper] MalruleLib: Large-Scale Executable Misconception Reasoning with Step Traces for Modeling Student Thinking in Mathematics

Published: 1 month ago (January 6, 2026 at 12:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.03217v1

Overview

The paper presents MalruleLib, a new framework that turns documented math misconceptions into executable procedures (called “malrules”) and automatically generates step‑by‑step traces of both correct and mistaken reasoning. By doing so, the authors create a massive synthetic dataset that lets language models be evaluated on a core student‑modeling task: given a single erroneous solution, infer the underlying misconception and predict the student’s next answer—even when the problem is phrased differently.

Key Contributions

Executable Misconception Library – 101 “malrules” derived from 67 learning‑science and math‑education sources, each encoded as a programmatic transformation of a correct solution.
Parameterized Problem Templates – 498 problem templates (e.g., linear equations, fractions) that can be instantiated with random numbers, yielding >1 M paired traces of correct vs. malrule‑consistent work.
Formal Task Definition (MRA) – Malrule Reasoning Accuracy measures a model’s ability to (1) identify the right malrule from a single mistake and (2) predict the student’s next answer under cross‑template rephrasing.
Comprehensive Empirical Study – Evaluation of nine LLMs (4 B to 120 B parameters) showing a steep drop in accuracy from 66 % (direct problem solving) to ~40 % (cross‑template misconception prediction).
Open‑Source Release – The full library, generation scripts, and evaluation benchmarks are publicly released for the educational‑AI community.

Methodology

Knowledge Curation – The authors mined 67 textbooks, research papers, and curriculum guides to extract common algebraic misconceptions (e.g., “multiply both sides by the denominator” when solving fractions).
Malrule Encoding – Each misconception is expressed as a deterministic program that takes a correct solution trace and rewrites it into a malrule‑consistent trace. This makes the error reproducible and composable.
Template Parameterization – A set of 498 problem schemas (e.g., “Solve for x: a·x + b = c”) is defined with placeholders for numeric coefficients. Random sampling fills these placeholders, producing millions of unique instances.
Dual‑Path Trace Generation – For every instantiated problem, the system generates two parallel step‑by‑step solutions: (a) the mathematically correct reasoning chain, and (b) the chain that follows a chosen malrule.
Evaluation Protocol (MRA) – Models receive a single erroneous step trace and must (i) classify which malrule generated it, and (ii) output the student’s next step for a re‑phrased version of the same problem (different template but same underlying structure).
Baseline Models – Nine transformer‑based LLMs ranging from 4 B to 120 B parameters are fine‑tuned on the generated data and tested on held‑out sets.

Results & Findings

Model Size	Direct Problem‑Solving Accuracy	Cross‑Template MRA Accuracy
4 B	61 %	35 %
13 B	68 %	42 %
30 B	70 %	44 %
120 B	73 %	48 %

Cross‑template degradation of 10–21 % is consistent across all sizes, indicating that current LLMs struggle to abstract the procedure behind a mistake.
Providing the full step trace (instead of just the final answer) improves MRA by 3–15 %, confirming that intermediate reasoning is a valuable signal.
The synthetic library enables controlled experiments: swapping one malrule for another changes performance predictably, demonstrating that the benchmark isolates misconception reasoning rather than surface lexical cues.

Practical Implications

Intelligent Tutoring Systems (ITS) can plug MalruleLib into their inference engine to diagnose a student’s misconception from a single error, then generate targeted hints that address the underlying faulty procedure.
Developer Toolkits – The library’s API lets developers generate custom problem sets with specific misconceptions, useful for training or evaluating domain‑specific LLMs (e.g., code‑assistants that need to understand user errors).
Curriculum Analytics – Education platforms can aggregate inferred malrule distributions across a cohort to spot systemic gaps (e.g., “most students misuse distributive property in quadratic expansions”).
Feedback Loop for Model Fine‑Tuning – By augmenting existing math‑QA datasets with malrule‑consistent traces, developers can teach models to anticipate student mistakes, leading to more robust answer‑checking and auto‑grading pipelines.
Cross‑Domain Transfer – Because malrules are executable, the same approach could be adapted to other STEM domains (physics problem solving, programming debugging), accelerating the creation of misconception‑aware AI assistants.

Limitations & Future Work

Synthetic vs. Real Data – While the library covers many textbook misconceptions, real classroom data may contain hybrid or undocumented errors that are not captured.
Scalability of Malrule Curation – Extending beyond algebra to higher‑level topics (calculus, statistics) will require additional domain expertise and manual encoding.
Model Generalization – Even the largest 120 B model still falls short of human‑level MRA, suggesting that architectural or training‑objective changes (e.g., explicit procedural reasoning modules) are needed.
User Interaction Studies – The paper does not evaluate how actual learners respond to malrule‑driven feedback; future work should conduct A/B tests in live tutoring environments.

MalruleLib opens the door to AI that not only solves math problems but also understands the systematic ways students get them wrong. For developers building the next generation of educational tools, it offers a ready‑made, scalable substrate for training, evaluating, and deploying misconception‑aware language models.

Authors

Xinghe Chen
Naiming Liu
Shashank Sonkar

Paper Information

arXiv ID: 2601.03217v1
Categories: cs.CL
Published: January 6, 2026
PDF: Download PDF

[Paper] MalruleLib: Large-Scale Executable Misconception Reasoning with Step Traces for Modeling Student Thinking in Mathematics

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

[Paper] Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

[Paper] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning