[Paper] Decoding Human-LLM Collaboration in Coding: An Empirical Study of Multi-Turn Conversations in the Wild

Published: 1 month ago (December 11, 2025 at 05:14 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10493v1

Overview

This paper investigates how developers actually interact with large language models (LLMs) when they use them as conversational coding assistants. By mining two massive real‑world conversation corpora—LMSYS‑Chat‑1M and WildChat—the authors map out the shapes of multi‑turn dialogues, measure how faithfully LLMs follow instructions, and gauge developer satisfaction across different coding tasks.

Key Contributions

Interaction‑Pattern Taxonomy: Identifies three dominant dialogue structures—linear, star, and tree—and links each to specific coding task categories.
Instruction‑Following Benchmark: Quantifies LLM compliance rates, revealing that bug‑fixing and refactoring requests trigger the highest non‑compliance.
Satisfaction Analysis: Shows a clear split: tasks focused on code quality or strict requirements tend to leave users less satisfied, while knowledge‑query and algorithm‑design conversations score higher.
Design Recommendations: Offers concrete UI/UX and prompting guidelines for building more effective LLM‑driven development tools.
Open Empirical Dataset: Provides processed annotations (task type, dialogue shape, compliance flag, satisfaction score) that can serve as a benchmark for future research.

Methodology

Data Collection: The study draws from two publicly available chat logs:
- LMSYS‑Chat‑1M – a million anonymized user‑LLM interactions.
- WildChat – a curated set of “in‑the‑wild” developer conversations from forums and IDE plugins.
Task Classification: Each conversation is manually labeled into high‑level coding intents (e.g., bug fixing, code generation, algorithm design, knowledge query).
Dialogue Shape Extraction: By tracking turn‑taking and reference links (e.g., “based on my previous snippet”), the authors map each chat to one of three graph structures:
- Linear: A single thread of back‑and‑forth.
- Star: One central user request with many independent LLM replies.
- Tree: Branching sub‑conversations (e.g., exploring design alternatives).
Compliance Scoring: An LLM‑based evaluator checks whether the assistant’s response adheres to the explicit instruction (e.g., “only change this function”).
User Satisfaction Proxy: Sentiment analysis on post‑chat feedback combined with explicit rating fields where available.
Statistical Analysis: Chi‑square tests and logistic regression are used to relate task type, dialogue shape, compliance, and satisfaction.

Results & Findings

Aspect	What the Data Shows
Dialogue Shape vs. Task	• Code quality optimization → predominantly linear (step‑by‑step refinement). • Design‑driven tasks (architecture, API design) → tree structures (multiple branches of alternatives). • Pure queries (e.g., “how does quicksort work?”) → star patterns (one question, many concise answers).
Instruction Following	• Overall compliance ≈ 84 %. • Bug fixing and refactoring drop to ~70 % compliance, the lowest among categories. • Simple information retrieval stays above 90 %.
User Satisfaction	• Highest scores for structured knowledge queries and algorithm design (average rating 4.3/5). • Lowest for code quality optimization and requirements‑driven development (average rating 3.1/5).
Cross‑Effect	Linear dialogues tend to have higher compliance but lower satisfaction when the underlying task is quality‑focused, suggesting that “getting the answer” ≠ “being happy with the result.”

Practical Implications

Tool Designers

Adaptive UI: Detect the emerging dialogue shape early (e.g., a tree pattern) and surface UI affordances like “branch selector” or “compare alternatives” to keep the conversation organized.
Prompt Templates: For bug‑fixing and refactoring, prepend explicit scaffolding (“Please list the exact lines you will modify”) to boost compliance.

LLM Developers

Fine‑Tuning Targets: Prioritize datasets that contain multi‑turn refactoring dialogs to close the compliance gap.
Safety Nets: Implement a “re‑ask” fallback that automatically verifies whether the assistant’s edit matches the user’s constraint.

DevOps & CI Integration

Use the linear pattern as a natural fit for automated code‑review bots that iteratively improve a snippet, while tree‑style dialogs can feed into design‑review pipelines that need multiple proposals.

Metrics & Monitoring

Track compliance and satisfaction per task type in production to surface pain points early (e.g., a spike in non‑compliance for refactoring may indicate a model regression).

Limitations & Future Work

Dataset Bias: Both corpora are skewed toward English‑speaking developers and may under‑represent niche languages or domain‑specific tooling.
Compliance Evaluation: The automated evaluator, while high‑performing, can misclassify nuanced instructions (e.g., “prefer readability over speed”). Human validation was limited to a sample.
Satisfaction Proxy: Sentiment analysis on free‑text feedback is an imperfect stand‑in for true user experience; explicit Likert‑scale surveys would be more reliable.

Future Directions

Extend the taxonomy to include mixed dialogue shapes and real‑time shape detection.
Explore reinforcement‑learning‑based adapters that dynamically adjust prompting strategies based on detected task type.
Conduct longitudinal user studies to see how satisfaction evolves as developers become more accustomed to LLM assistants.

Authors

Binquan Zhang
Li Zhang
Haoyuan Zhang
Fang Liu
Song Wang
Bo Shen
An Fu
Lin Shi

Paper Information

arXiv ID: 2512.10493v1
Categories: cs.SE
Published: December 11, 2025
PDF: Download PDF