[Paper] AI tutoring can safely and effectively support students: An exploratory RCT in UK classrooms

Published: (December 29, 2025 at 12:44 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23633v1

Overview

A recent exploratory randomized controlled trial (RCT) examined whether a generative‑AI tutor—LearnLM, fine‑tuned for pedagogy—can deliver the same (or better) learning outcomes as human tutors in UK secondary‑school math classes. By embedding the model in a chat‑based interface and letting expert tutors supervise its replies, the study shows that AI‑driven tutoring can be both safe and effective at scale.

Key Contributions

  • Pedagogical fine‑tuning: Demonstrated a systematic approach to adapt a large language model (LLM) for classroom‑level tutoring.
  • Human‑in‑the‑loop supervision: Tutors reviewed AI‑generated messages before sending them, achieving a 76.4 % “minimal‑edit” rate.
  • Empirical performance: Students assisted by LearnLM performed on par with, and in some cases outperformed, peers receiving only human tutoring (5.5 pp higher success on novel problems).
  • Socratic questioning capability: Tutors reported that LearnLM excelled at generating probing questions that deepened student reasoning.
  • Bidirectional learning: Human tutors claimed they learned new pedagogical techniques from the model’s suggestions.

Methodology

  1. Participants & Setting – 165 students from five UK secondary schools were randomly assigned to either:

    • AI‑assisted tutoring (LearnLM + human supervisor)
    • Human‑only tutoring (traditional one‑to‑one chat).
  2. Technology Stack – LearnLM was built on a large‑scale transformer model, further fine‑tuned on a curated corpus of math tutoring dialogues, feedback loops, and Socratic‑question patterns.

  3. Supervision Workflow – For each student query, LearnLM drafted a response. A human tutor then either approved it (zero/minimal edits) or edited it before sending. This kept the interaction safe while allowing the AI to handle the bulk of content generation.

  4. Assessment – Learning outcomes were measured via:

    • Immediate problem‑solving accuracy on the target topic.
    • Transfer performance on novel problems from the next topic.
    • Qualitative tutor interviews about the interaction quality.
  5. Statistical Analysis – Differences in success rates were evaluated using mixed‑effects logistic regression to account for classroom clustering and individual ability variance.

Results & Findings

MetricAI‑assisted (LearnLM)Human‑onlyEffect
Approval rate (≤2 character edits)76.4 %N/AIndicates high fidelity of AI drafts
Success on target problems≈ same as humanNo degradation
Success on novel problems (next topic)66.2 %60.7 %+5.5 pp (statistically significant)
Tutor satisfaction (qualitative)Positive – praised Socratic promptsTutors felt AI contributed pedagogical value

Key takeaways: LearnLM can reliably generate tutoring content that requires little human correction, and its Socratic style may boost students’ ability to transfer knowledge to new problems.

Practical Implications

  • Scalable tutoring services: EdTech platforms can integrate a fine‑tuned LLM as a first‑line tutor, reserving human experts for oversight or edge cases, dramatically reducing cost per student.
  • Developer‑friendly APIs: The study’s workflow can be replicated via a “draft‑then‑approve” API pattern—LLM generates a message, returns a confidence score, and a human reviewer decides to send or edit.
  • Enhanced adaptive learning: Socratic‑question generation can be exposed as a modular component, allowing developers to plug it into existing recommendation or feedback loops.
  • Teacher professional development: The bidirectional learning effect suggests AI can serve as a “coach” for teachers, surfacing effective questioning techniques that can be harvested for training programs.
  • Compliance & safety: The human‑in‑the‑loop model satisfies many regulatory concerns around AI‑generated educational content, offering a pragmatic path to deployment in K‑12 environments.

Limitations & Future Work

  • Sample size & diversity: The trial involved 165 students from a limited geographic region; broader studies are needed to confirm generalizability across subjects, age groups, and cultural contexts.
  • Supervision overhead: While the edit rate was low, the study did not quantify the exact time burden on tutors; future work should measure cost‑benefit trade‑offs more precisely.
  • Long‑term retention: The experiment focused on short‑term problem‑solving; longitudinal studies are required to assess knowledge retention over months or semesters.
  • Model bias & fairness: The paper notes no systematic bias, but deeper audits are necessary to ensure equitable treatment of diverse learners.
  • Automation of supervision: Exploring confidence‑threshold mechanisms or reinforcement‑learning from human feedback could further reduce the need for manual review.

For developers interested in experimenting with AI‑driven tutoring, the core takeaway is that a well‑fine‑tuned LLM, coupled with a lightweight human‑in‑the‑loop workflow, can deliver pedagogically sound, scalable support—opening the door to more affordable, personalized education at scale.

Authors

  • LearnLM Team
  • Eedi
  • Albert Wang
  • Aliya Rysbek
  • Andrea Huber
  • Anjali Nambiar
  • Anna Kenolty
  • Ben Caulfield
  • Beth Lilley‑Draper
  • Bibi Groot
  • Brian Veprek
  • Chelsea Burdett
  • Claire Willis
  • Craig Barton
  • Digory Smith
  • George Mu
  • Harriet Walters
  • Irina Jurenka
  • Iris Hulls
  • James Stalley‑Moores
  • Jonathan Caton
  • Julia Wilkowski
  • Kaiz Alarakyia
  • Kevin R. McKee
  • Liam McCafferty
  • Lucy Dalton
  • Markus Kunesch
  • Pauline Malubay
  • Rachel Kidson
  • Rich Wells
  • Sam Wheeler
  • Sara Wiltberger
  • Shakir Mohamed
  • Simon Woodhead
  • Vasco Brazão

Paper Information

  • arXiv ID: 2512.23633v1
  • Categories: cs.CY, cs.AI, cs.LG
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »