[Paper] AI tutoring can safely and effectively support students: An exploratory RCT in UK classrooms
Source: arXiv - 2512.23633v1
Overview
A recent exploratory randomized controlled trial (RCT) examined whether a generative‑AI tutor—LearnLM, fine‑tuned for pedagogy—can deliver the same (or better) learning outcomes as human tutors in UK secondary‑school math classes. By embedding the model in a chat‑based interface and letting expert tutors supervise its replies, the study shows that AI‑driven tutoring can be both safe and effective at scale.
Key Contributions
- Pedagogical fine‑tuning: Demonstrated a systematic approach to adapt a large language model (LLM) for classroom‑level tutoring.
- Human‑in‑the‑loop supervision: Tutors reviewed AI‑generated messages before sending them, achieving a 76.4 % “minimal‑edit” rate.
- Empirical performance: Students assisted by LearnLM performed on par with, and in some cases outperformed, peers receiving only human tutoring (5.5 pp higher success on novel problems).
- Socratic questioning capability: Tutors reported that LearnLM excelled at generating probing questions that deepened student reasoning.
- Bidirectional learning: Human tutors claimed they learned new pedagogical techniques from the model’s suggestions.
Methodology
-
Participants & Setting – 165 students from five UK secondary schools were randomly assigned to either:
- AI‑assisted tutoring (LearnLM + human supervisor)
- Human‑only tutoring (traditional one‑to‑one chat).
-
Technology Stack – LearnLM was built on a large‑scale transformer model, further fine‑tuned on a curated corpus of math tutoring dialogues, feedback loops, and Socratic‑question patterns.
-
Supervision Workflow – For each student query, LearnLM drafted a response. A human tutor then either approved it (zero/minimal edits) or edited it before sending. This kept the interaction safe while allowing the AI to handle the bulk of content generation.
-
Assessment – Learning outcomes were measured via:
- Immediate problem‑solving accuracy on the target topic.
- Transfer performance on novel problems from the next topic.
- Qualitative tutor interviews about the interaction quality.
-
Statistical Analysis – Differences in success rates were evaluated using mixed‑effects logistic regression to account for classroom clustering and individual ability variance.
Results & Findings
| Metric | AI‑assisted (LearnLM) | Human‑only | Effect |
|---|---|---|---|
| Approval rate (≤2 character edits) | 76.4 % | N/A | Indicates high fidelity of AI drafts |
| Success on target problems | ≈ same as human | — | No degradation |
| Success on novel problems (next topic) | 66.2 % | 60.7 % | +5.5 pp (statistically significant) |
| Tutor satisfaction (qualitative) | Positive – praised Socratic prompts | — | Tutors felt AI contributed pedagogical value |
Key takeaways: LearnLM can reliably generate tutoring content that requires little human correction, and its Socratic style may boost students’ ability to transfer knowledge to new problems.
Practical Implications
- Scalable tutoring services: EdTech platforms can integrate a fine‑tuned LLM as a first‑line tutor, reserving human experts for oversight or edge cases, dramatically reducing cost per student.
- Developer‑friendly APIs: The study’s workflow can be replicated via a “draft‑then‑approve” API pattern—LLM generates a message, returns a confidence score, and a human reviewer decides to send or edit.
- Enhanced adaptive learning: Socratic‑question generation can be exposed as a modular component, allowing developers to plug it into existing recommendation or feedback loops.
- Teacher professional development: The bidirectional learning effect suggests AI can serve as a “coach” for teachers, surfacing effective questioning techniques that can be harvested for training programs.
- Compliance & safety: The human‑in‑the‑loop model satisfies many regulatory concerns around AI‑generated educational content, offering a pragmatic path to deployment in K‑12 environments.
Limitations & Future Work
- Sample size & diversity: The trial involved 165 students from a limited geographic region; broader studies are needed to confirm generalizability across subjects, age groups, and cultural contexts.
- Supervision overhead: While the edit rate was low, the study did not quantify the exact time burden on tutors; future work should measure cost‑benefit trade‑offs more precisely.
- Long‑term retention: The experiment focused on short‑term problem‑solving; longitudinal studies are required to assess knowledge retention over months or semesters.
- Model bias & fairness: The paper notes no systematic bias, but deeper audits are necessary to ensure equitable treatment of diverse learners.
- Automation of supervision: Exploring confidence‑threshold mechanisms or reinforcement‑learning from human feedback could further reduce the need for manual review.
For developers interested in experimenting with AI‑driven tutoring, the core takeaway is that a well‑fine‑tuned LLM, coupled with a lightweight human‑in‑the‑loop workflow, can deliver pedagogically sound, scalable support—opening the door to more affordable, personalized education at scale.
Authors
- LearnLM Team
- Eedi
- Albert Wang
- Aliya Rysbek
- Andrea Huber
- Anjali Nambiar
- Anna Kenolty
- Ben Caulfield
- Beth Lilley‑Draper
- Bibi Groot
- Brian Veprek
- Chelsea Burdett
- Claire Willis
- Craig Barton
- Digory Smith
- George Mu
- Harriet Walters
- Irina Jurenka
- Iris Hulls
- James Stalley‑Moores
- Jonathan Caton
- Julia Wilkowski
- Kaiz Alarakyia
- Kevin R. McKee
- Liam McCafferty
- Lucy Dalton
- Markus Kunesch
- Pauline Malubay
- Rachel Kidson
- Rich Wells
- Sam Wheeler
- Sara Wiltberger
- Shakir Mohamed
- Simon Woodhead
- Vasco Brazão
Paper Information
- arXiv ID: 2512.23633v1
- Categories: cs.CY, cs.AI, cs.LG
- Published: December 29, 2025
- PDF: Download PDF