[Paper] Decoding Human-LLM Collaboration in Coding: An Empirical Study of Multi-Turn Conversations in the Wild
Source: arXiv - 2512.10493v1
Overview
This paper investigates how developers actually interact with large language models (LLMs) when they use them as conversational coding assistants. By mining two massive real‑world conversation corpora—LMSYS‑Chat‑1M and WildChat—the authors map out the shapes of multi‑turn dialogues, measure how faithfully LLMs follow instructions, and gauge developer satisfaction across different coding tasks.
Key Contributions
- Interaction‑Pattern Taxonomy: Identifies three dominant dialogue structures—linear, star, and tree—and links each to specific coding task categories.
- Instruction‑Following Benchmark: Quantifies LLM compliance rates, revealing that bug‑fixing and refactoring requests trigger the highest non‑compliance.
- Satisfaction Analysis: Shows a clear split: tasks focused on code quality or strict requirements tend to leave users less satisfied, while knowledge‑query and algorithm‑design conversations score higher.
- Design Recommendations: Offers concrete UI/UX and prompting guidelines for building more effective LLM‑driven development tools.
- Open Empirical Dataset: Provides processed annotations (task type, dialogue shape, compliance flag, satisfaction score) that can serve as a benchmark for future research.
Methodology
- Data Collection: The study draws from two publicly available chat logs:
- LMSYS‑Chat‑1M – a million anonymized user‑LLM interactions.
- WildChat – a curated set of “in‑the‑wild” developer conversations from forums and IDE plugins.
- Task Classification: Each conversation is manually labeled into high‑level coding intents (e.g., bug fixing, code generation, algorithm design, knowledge query).
- Dialogue Shape Extraction: By tracking turn‑taking and reference links (e.g., “based on my previous snippet”), the authors map each chat to one of three graph structures:
- Linear: A single thread of back‑and‑forth.
- Star: One central user request with many independent LLM replies.
- Tree: Branching sub‑conversations (e.g., exploring design alternatives).
- Compliance Scoring: An LLM‑based evaluator checks whether the assistant’s response adheres to the explicit instruction (e.g., “only change this function”).
- User Satisfaction Proxy: Sentiment analysis on post‑chat feedback combined with explicit rating fields where available.
- Statistical Analysis: Chi‑square tests and logistic regression are used to relate task type, dialogue shape, compliance, and satisfaction.
Results & Findings
| Aspect | What the Data Shows |
|---|---|
| Dialogue Shape vs. Task | • Code quality optimization → predominantly linear (step‑by‑step refinement). • Design‑driven tasks (architecture, API design) → tree structures (multiple branches of alternatives). • Pure queries (e.g., “how does quicksort work?”) → star patterns (one question, many concise answers). |
| Instruction Following | • Overall compliance ≈ 84 %. • Bug fixing and refactoring drop to ~70 % compliance, the lowest among categories. • Simple information retrieval stays above 90 %. |
| User Satisfaction | • Highest scores for structured knowledge queries and algorithm design (average rating 4.3/5). • Lowest for code quality optimization and requirements‑driven development (average rating 3.1/5). |
| Cross‑Effect | Linear dialogues tend to have higher compliance but lower satisfaction when the underlying task is quality‑focused, suggesting that “getting the answer” ≠ “being happy with the result.” |
Practical Implications
Tool Designers
- Adaptive UI: Detect the emerging dialogue shape early (e.g., a tree pattern) and surface UI affordances like “branch selector” or “compare alternatives” to keep the conversation organized.
- Prompt Templates: For bug‑fixing and refactoring, prepend explicit scaffolding (“Please list the exact lines you will modify”) to boost compliance.
LLM Developers
- Fine‑Tuning Targets: Prioritize datasets that contain multi‑turn refactoring dialogs to close the compliance gap.
- Safety Nets: Implement a “re‑ask” fallback that automatically verifies whether the assistant’s edit matches the user’s constraint.
DevOps & CI Integration
- Use the linear pattern as a natural fit for automated code‑review bots that iteratively improve a snippet, while tree‑style dialogs can feed into design‑review pipelines that need multiple proposals.
Metrics & Monitoring
- Track compliance and satisfaction per task type in production to surface pain points early (e.g., a spike in non‑compliance for refactoring may indicate a model regression).
Limitations & Future Work
- Dataset Bias: Both corpora are skewed toward English‑speaking developers and may under‑represent niche languages or domain‑specific tooling.
- Compliance Evaluation: The automated evaluator, while high‑performing, can misclassify nuanced instructions (e.g., “prefer readability over speed”). Human validation was limited to a sample.
- Satisfaction Proxy: Sentiment analysis on free‑text feedback is an imperfect stand‑in for true user experience; explicit Likert‑scale surveys would be more reliable.
Future Directions
- Extend the taxonomy to include mixed dialogue shapes and real‑time shape detection.
- Explore reinforcement‑learning‑based adapters that dynamically adjust prompting strategies based on detected task type.
- Conduct longitudinal user studies to see how satisfaction evolves as developers become more accustomed to LLM assistants.
Authors
- Binquan Zhang
- Li Zhang
- Haoyuan Zhang
- Fang Liu
- Song Wang
- Bo Shen
- An Fu
- Lin Shi
Paper Information
- arXiv ID: 2512.10493v1
- Categories: cs.SE
- Published: December 11, 2025
- PDF: Download PDF