[Paper] 'How Do I ...?': Procedural Questions Predominate Student-LLM Chatbot Conversations

Published: 3 days ago (February 20, 2026 at 12:27 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.18372v1

Overview

The paper investigates what kinds of questions students actually ask when they interact with LLM‑powered educational chatbots. By analyzing over 6 000 messages from two distinct learning settings—self‑study and assessed coursework—the authors show that procedural questions (e.g., “How do I …?”) dominate the dialogue, especially when students are preparing for high‑stakes exams. The study also evaluates whether LLMs themselves can serve as reliable raters for classifying these questions, finding that they can match or exceed human consistency.

Key Contributions

Empirical taxonomy of student questions in chatbot interactions across formative and summative contexts.
Large‑scale annotation of 6,113 messages using four established question‑type schemas, performed by three human raters and eleven different LLMs.
Reliability analysis demonstrating that LLMs achieve moderate‑to‑good inter‑rater agreement, often surpassing human raters.
Insight into question distribution: procedural (“how‑to”) questions are the most frequent, with a higher share in exam‑preparation scenarios.
Critical reflection on the limits of existing classification schemas for capturing the richness of multi‑turn, composite prompts.
Roadmap for future research, recommending conversation‑analysis techniques from discursive psychology to better model the dynamic nature of student‑LLM dialogue.

Methodology

Data Collection
- Two corpora were compiled: (a) self‑study sessions where learners used a chatbot for practice, and (b) coursework submissions where a chatbot was allowed as a study aid.
- The combined set contains 6,113 individual student messages.
Annotation Schemas
- Four pre‑existing question‑type frameworks (e.g., procedural, conceptual, factual, metacognitive) were adopted.
- Each message was labeled according to the schema that best described the primary intent of the question.
Raters
- Human: Three domain‑experienced annotators performed independent labeling.
- LLM: Eleven state‑of‑the‑art models (including GPT‑4, Claude, Llama 2, etc.) were prompted to classify the same messages.
Reliability Measurement
- Inter‑rater reliability was quantified using Cohen’s κ and Krippendorff’s α.
- Comparisons were made between human‑human, LLM‑LLM, and human‑LLM agreement levels.
Statistical Analysis
- Frequency distributions of question types were compared across the two learning contexts.
- Significance testing (χ²) assessed whether procedural dominance differed between formative and summative settings.

Results & Findings

Metric	Human Raters	LLM Raters
Average κ (pairwise)	0.62 (moderate)	0.71 (moderate‑to‑good)
Average α (overall)	0.65	0.74
Procedural question share (self‑study)	48 %	—
Procedural question share (summative)	63 %	—

Procedural dominance: Across both contexts, “how‑to” questions were the most common, rising sharply in the exam‑prep corpus.
LLM reliability: The LLMs not only matched human consistency but also showed less variance across models, suggesting they can serve as scalable annotators.
Schema limitations: Many student prompts blended procedural, conceptual, and metacognitive elements, causing mis‑classifications under the rigid schemas.

Practical Implications

Chatbot design: Knowing that learners primarily ask procedural questions, developers can prioritize step‑by‑step guidance, scaffolding, and explicit workflow explanations in the bot’s response generation pipeline.
Adaptive tutoring: Real‑time detection of procedural queries (via LLM‑based classifiers) can trigger richer, example‑driven explanations or interactive code snippets, improving learning outcomes.
Assessment integrity: The surge of procedural questions in summative contexts signals a risk of over‑reliance on “quick‑fix” answers; educators might need to embed prompts that encourage deeper conceptual reasoning.
Scalable analytics: Organizations can deploy LLMs as cheap, consistent annotators to monitor large volumes of student‑bot interactions, flagging trends (e.g., rising procedural demand) without hiring extensive human labeling teams.
Curriculum feedback: Aggregated question‑type data can reveal curriculum gaps—if procedural questions dominate, it may indicate that instructional materials lack clear procedural guidance.

Limitations & Future Work

Schema rigidity: Existing taxonomies struggled to capture multi‑intent, composite prompts, leading to noisy classifications.
Contextual nuance: The study treated each message in isolation; real‑world conversations are multi‑turn, and meaning often emerges across exchanges.
Generalizability: Datasets were limited to two academic domains; results may differ in other subjects or with different chatbot interfaces.
Future direction: The authors advocate for conversation‑analysis methods from discursive psychology to model the full dialogue flow, and for developing richer, multi‑label classification frameworks that reflect the layered nature of student queries.

Authors

Alexandra Neagu
Marcus Messer
Peter Johnson
Rhodri Nelson

Paper Information

arXiv ID: 2602.18372v1
Categories: cs.HC, cs.AI
Published: February 20, 2026
PDF: Download PDF

[Paper] 'How Do I ...?': Procedural Questions Predominate Student-LLM Chatbot Conversations

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Assigning Confidence: K-partition Ensembles

[Paper] The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

[Paper] Benchmarking Graph Neural Networks in Solving Hard Constraint Satisfaction Problems

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures