Understanding AI and learning outcomes
Source: OpenAI Blog
Introduction
Education is one of AI’s most promising frontiers. With tools like ChatGPT, personalized learning support can be available to any student, anywhere, at any time.
But the education sector is still early in its understanding of the impact of AI on learning outcomes. Last year, our team set out to study the use of tools like Study Mode and found promising gains in student performance. Our research also raised an important question: how can we assess how AI influences a learner’s progress over time, not just on a final exam?
This is a broader ecosystem challenge. To date, most research methods focus on narrow performance signals—such as test scores—and lack the ability to assess how students actually learn with AI in real‑world settings, and how that use shapes outcomes over time.
To address this gap, we developed the Learning Outcomes Measurement Suite, a framework created with Estonia’s University of Tartu and the SCALE Initiative at the Stanford Accelerator for Learning to support longitudinal measurement of learning outcomes across different educational contexts.
Extensive validation is underway through a randomized controlled trial, and further research is planned with founding organizations in the Learning Lab, OpenAI’s learning research ecosystem, including researchers from Arizona State University, UCL Knowledge Lab, and MIT Media Lab (building on prior collaborative studies).
Today, we’re sharing an overview of how the measurement suite works and why it matters. Over time, we intend to publish more research and release the suite as a public resource for schools, universities, and education systems worldwide.
“This research allows us to learn quickly while also laying the groundwork for a deeper understanding of how AI can be thoughtfully integrated into schools in ways that truly matter. We want to understand how these tools can support rigorous academic learning while also cultivating higher‑order thinking, creativity, curiosity, and students’ confidence in themselves as learners.”
– Susanna Loeb, Professor of Education and Faculty Director, SCALE Initiative at Stanford University
Summary of takeaways
- Today’s research methods on the impact of AI on learning show promising signals about performance, but they don’t capture the full picture of how AI affects learning outcomes over time.
- The Learning Outcomes Measurement Suite will, for the first time, provide a standard framework for longitudinal studies that help educators, researchers, and institutions understand how AI shapes learning and outcomes across different contexts.
- OpenAI’s Learning Lab is a new research ecosystem focused on advancing this work. OpenAI will publish findings alongside a range of partners as the field continues to develop.
Origins and early research
When students use AI tools to study and learn, it can mean many different things—from seeking quick answers to working through problems step‑by‑step with tutor‑like guidance. To encourage users to engage with ChatGPT in ways that support deeper understanding and skill‑building, OpenAI introduced Study Mode last year. Under the hood, Study Mode is powered by custom system instructions we wrote in collaboration with teachers, scientists, and pedagogy experts to reflect a core set of behaviors that support true learning—not just answers—using scaffolding, checks for understanding, and guided practice.
To test whether this pedagogically aligned AI interaction style translates into better learning outcomes, we ran a randomized study with over 300 college students preparing for neuroscience and microeconomics exams. While analysis is still underway, early results give us confidence that a pedagogically aligned AI interaction style, encouraged through features like Study Mode, can improve learning outcomes. This research also surfaced an important reality: what really matters is whether the gains and associated productive behaviors remain durable over time.
Study design
Participants were assigned to one of three groups:
- Control – studied using traditional online resources (Google Search, YouTube) with AI‑generated overview features disabled.
- Study Mode Variant A – accessed a version of Study Mode designed to guide students through the learning process.
- Study Mode Variant B – accessed a slightly different Study Mode variant.
Baseline quizzes and onboarding surveys were collected ahead of time to adjust for differences in prior coursework exposure, study habits, academic confidence, and familiarity with AI tools. Students completed timed Study Mode sessions before each exam, with the two variants counterbalanced across subjects.
The setup was designed to reflect real‑world study conditions rather than a tightly controlled lab environment. Participation was not tied to exam performance, and not all students used Study Mode to the same extent during the nominal 40‑minute sessions. This allowed us to measure and report intention‑to‑treat (ITT) effects—the impact of being provided access to the tool under realistic rollout conditions—acknowledging that engagement can vary in practice.
Findings
Performance was measured on each exam separately. Improvements were not uniform across subjects, and engagement with Study Mode varied across participants.
- Neuroscience (primary ITT) – We observed directionally positive differences for Study Mode relative to control, but results were not statistically distinguishable from students studying with traditional online resources. Some onboarding and technical issues impacted time spent studying among students using Study Mode.
- Microeconomics (primary ITT) – We observed meaningful gains in exam performance for participants with access to Study Mode compared with the control group.
Full statistical results will be shared in a forthcoming peer‑reviewed publication.
Performance among students assigned access to Study Mode vs. the no‑AI control group – roughly a 15 % higher score relative.
Study Mode (variants A & B) vs. Control (no‑AI group): Adjusted mean exam scores
The effect remains consistent when we compare each Study Mode variant separately with the control.
While this reflects real‑world variation, it highlights a deeper limitation in how learning outcomes are typically measured.
Most existing evaluation approaches rely on fixed interventions assessed over short time windows, using outcomes such as test scores or final essays as primary signals. These methods are not designed to capture the core mechanism through which AI affects learning in practice: ongoing, personalized interactions that evolve alongside a learner’s own strategies, preferences, and study habits. Nor do they surface whether improvements in one capability (e.g., short‑term recall) may come alongside trade‑offs in others (e.g., persistence, autonomous motivation, or creative problem‑solving). As a result, they miss the longitudinal cognitive effects that ultimately determine whether AI meaningfully improves learning.
Because learning environments differ widely across countries, curricula, and institutional goals, outcomes from one‑off studies rarely generalize across systems. Measurement approaches must therefore be flexible enough for different education systems to:
- Define what success looks like in their context
- Evaluate AI against their own standards
- Iterate accordingly
Building a Better Measurement System
Based on the learnings from OpenAI’s Study Mode research, we have been building a structured measurement system to assess AI’s impact on learners at scale and to create a mechanism for improving models based on those outcomes. It is grounded in three signals—how the model behaves, how learners respond, and what measurable cognitive outcomes result over time. The system includes:
| Component | Description |
|---|---|
| System instructions to refine model behavior | Use natural language to change the default behavior of the model so it aligns with specific pedagogical approaches. |
| Learning interaction classifiers | Automatically detect “learning moments” within real, de‑identified learner–model interactions and label salient characteristics such as engagement and error correction. |
| Learning quality graders | Evaluate and score each learning moment by whether the learner achieved their objective and the degree to which the interaction followed strong pedagogical principles, including identification of failure modes. |
| Longitudinal learning graders | Track changes in the same learner’s interactions with the model over time—including engagement, persistence, and metacognitive strategies—at the individual and cohort levels. |
| Standardized cognitive and metacognitive measures | Validated third‑party instruments delivered via ChatGPT pre/during/post access to establish baselines and measure changes in foundational capabilities such as critical thinking, creativity, and memory. |
When combined, we refer to this measurement system as the Learning Outcomes Measurement Suite.
What the Suite Produces
- Structured views of learning moments
- Dashboards showing how outcomes shift over time across cohorts
- Indicators of model performance against teaching and tutoring rubrics
- Outcome measures aligned to standardized assessments and short learner questionnaires
Where available, it can incorporate partner‑provided ground truth such as exam scores, classroom observations, or attendance. All data are de‑identified.
Deeper Cognitive Impacts Tracked
- Autonomous Motivation – the degree to which learners shape their own studies vs. being directed by the model
- Productive Engagement – the frequency, variety, and quality of pedagogical interactions
- Task Persistence – the degree to which a learner sits with and pushes through cognitive challenges
- Metacognition – the frequency and quality of learners’ efforts to plan, reflect, and monitor their studying approaches
- Recall – the accuracy with which a learner can remember content from previous interactions
This reflects our overall effort to move beyond narrow definitions of learning outcomes (e.g., rising test scores) toward the holistic capabilities that underpin learning. We also recognize that there is no silver bullet; systems and educators will need to be empowered to guide trade‑offs in alignment with pedagogical best practices.
Where We Go From Here
We are validating the Learning Outcomes Measurement Suite through large‑scale studies before making it broadly available. This work is underway with the University of Tartu and Stanford’s SCALE Initiative across nation‑scale partners like Estonia, where the suite is being studied with nearly 20,000 students aged 16‑18 over several months. Student use will happen in close collaboration with local leaders to ensure safety and alignment with local curricula.
“Estonia has always approached education not as static but as a system we continuously improve. With AI becoming part of that picture, the big question is how we measure AI’s long‑term impact on learning. That’s what we’re figuring out in collaboration with OpenAI. Students are keen to be involved in the development process, and many want to learn how to support learning with AI. It feels like a real turning point, and we’re excited to contribute methods that other education systems can reuse and build on.”
— Jaan Aru, University of Tartu
This work builds on a broader body of collaborative research underway. In addition to the outcomes research being conducted through founding partners in the L… (text truncated)
Learning Lab
OpenAI is supporting studies at the intersection of learning and labor—examining how AI shapes students’ academic pathways, career decisions, and the ways institutions can support responsible adoption. This research is happening across:
- Bocconi University
- Innova Schools
- Tuck School of Business at Dartmouth
- San Diego State University
- Stony Brook University
- …and others
As we run longer‑term studies on how students learn best with AI, we intend to share findings and work with the broader education ecosystem to ensure AI benefits learners everywhere.
Those interested in receiving updates on this work can sign up here.