[Paper] 'How Do I ...?': Procedural Questions Predominate Student-LLM Chatbot Conversations

Published: (February 20, 2026 at 12:27 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.18372v1

Overview

The paper investigates what kinds of questions students actually ask when they interact with LLM‑powered educational chatbots. By analyzing over 6 000 messages from two distinct learning settings—self‑study and assessed coursework—the authors show that procedural questions (e.g., “How do I …?”) dominate the dialogue, especially when students are preparing for high‑stakes exams. The study also evaluates whether LLMs themselves can serve as reliable raters for classifying these questions, finding that they can match or exceed human consistency.

Key Contributions

  • Empirical taxonomy of student questions in chatbot interactions across formative and summative contexts.
  • Large‑scale annotation of 6,113 messages using four established question‑type schemas, performed by three human raters and eleven different LLMs.
  • Reliability analysis demonstrating that LLMs achieve moderate‑to‑good inter‑rater agreement, often surpassing human raters.
  • Insight into question distribution: procedural (“how‑to”) questions are the most frequent, with a higher share in exam‑preparation scenarios.
  • Critical reflection on the limits of existing classification schemas for capturing the richness of multi‑turn, composite prompts.
  • Roadmap for future research, recommending conversation‑analysis techniques from discursive psychology to better model the dynamic nature of student‑LLM dialogue.

Methodology

  1. Data Collection

    • Two corpora were compiled: (a) self‑study sessions where learners used a chatbot for practice, and (b) coursework submissions where a chatbot was allowed as a study aid.
    • The combined set contains 6,113 individual student messages.
  2. Annotation Schemas

    • Four pre‑existing question‑type frameworks (e.g., procedural, conceptual, factual, metacognitive) were adopted.
    • Each message was labeled according to the schema that best described the primary intent of the question.
  3. Raters

    • Human: Three domain‑experienced annotators performed independent labeling.
    • LLM: Eleven state‑of‑the‑art models (including GPT‑4, Claude, Llama 2, etc.) were prompted to classify the same messages.
  4. Reliability Measurement

    • Inter‑rater reliability was quantified using Cohen’s κ and Krippendorff’s α.
    • Comparisons were made between human‑human, LLM‑LLM, and human‑LLM agreement levels.
  5. Statistical Analysis

    • Frequency distributions of question types were compared across the two learning contexts.
    • Significance testing (χ²) assessed whether procedural dominance differed between formative and summative settings.

Results & Findings

MetricHuman RatersLLM Raters
Average κ (pairwise)0.62 (moderate)0.71 (moderate‑to‑good)
Average α (overall)0.650.74
Procedural question share (self‑study)48 %
Procedural question share (summative)63 %
  • Procedural dominance: Across both contexts, “how‑to” questions were the most common, rising sharply in the exam‑prep corpus.
  • LLM reliability: The LLMs not only matched human consistency but also showed less variance across models, suggesting they can serve as scalable annotators.
  • Schema limitations: Many student prompts blended procedural, conceptual, and metacognitive elements, causing mis‑classifications under the rigid schemas.

Practical Implications

  • Chatbot design: Knowing that learners primarily ask procedural questions, developers can prioritize step‑by‑step guidance, scaffolding, and explicit workflow explanations in the bot’s response generation pipeline.
  • Adaptive tutoring: Real‑time detection of procedural queries (via LLM‑based classifiers) can trigger richer, example‑driven explanations or interactive code snippets, improving learning outcomes.
  • Assessment integrity: The surge of procedural questions in summative contexts signals a risk of over‑reliance on “quick‑fix” answers; educators might need to embed prompts that encourage deeper conceptual reasoning.
  • Scalable analytics: Organizations can deploy LLMs as cheap, consistent annotators to monitor large volumes of student‑bot interactions, flagging trends (e.g., rising procedural demand) without hiring extensive human labeling teams.
  • Curriculum feedback: Aggregated question‑type data can reveal curriculum gaps—if procedural questions dominate, it may indicate that instructional materials lack clear procedural guidance.

Limitations & Future Work

  • Schema rigidity: Existing taxonomies struggled to capture multi‑intent, composite prompts, leading to noisy classifications.
  • Contextual nuance: The study treated each message in isolation; real‑world conversations are multi‑turn, and meaning often emerges across exchanges.
  • Generalizability: Datasets were limited to two academic domains; results may differ in other subjects or with different chatbot interfaces.
  • Future direction: The authors advocate for conversation‑analysis methods from discursive psychology to model the full dialogue flow, and for developing richer, multi‑label classification frameworks that reflect the layered nature of student queries.

Authors

  • Alexandra Neagu
  • Marcus Messer
  • Peter Johnson
  • Rhodri Nelson

Paper Information

  • arXiv ID: 2602.18372v1
  • Categories: cs.HC, cs.AI
  • Published: February 20, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »