[Paper] An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

Published: (March 10, 2026 at 10:12 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.09701v1

Overview

This paper investigates why large language models (LLMs) that generate code often stumble during long, back‑and‑forth conversations with developers. The authors call these hidden problems Interaction Smells—subtle breakdowns in the dialogue that can derail a coding task even when the final code snippet looks correct. By cataloguing these smells, measuring how often they appear in popular LLMs, and proposing a lightweight fix, the work bridges a gap between academic benchmarks and the day‑to‑day experience of developers using AI‑assisted coding tools.

Key Contributions

  • First taxonomy of Interaction Smells for human‑LLM code generation, organized into three high‑level groups (User Intent Quality, Historical Instruction Compliance, Historical Response Violation) and nine concrete sub‑categories.
  • Large‑scale empirical analysis on real‑world multi‑turn chats from the WildChat and LMSYS‑Chat‑1M datasets, covering six widely used LLMs (GPT‑4o, DeepSeek‑Chat, Gemini 2.5, Qwen2.5‑32B/72B, Qwen3‑235B‑a22b).
  • Quantitative distribution report showing which models are more prone to specific smells and how the problem evolves over conversation length.
  • Invariant‑aware Constraint Evolution (InCE), a multi‑agent framework that extracts global “invariants” (e.g., naming conventions, API contracts) from the whole dialogue and runs a pre‑generation audit to catch violations before code is emitted.
  • Empirical validation on the extended WildBench benchmark, demonstrating a measurable lift in Task Success Rate and a strong reduction in the frequency of Interaction Smells with only a modest overhead.

Methodology

  1. Data collection – The authors sampled multi‑turn programming sessions from two public corpora:

    • WildChat – real developer‑LLM chats harvested from open‑source repositories.
    • LMSYS‑Chat‑1M – a large, diverse set of human‑LLM conversations.
  2. Taxonomy building – Using open card‑sorting, three researchers manually inspected hundreds of interaction logs, iteratively clustering recurring issues until a stable set of nine smell types emerged.

  3. Model evaluation – Six LLMs were prompted with the same interaction histories. For each turn, the system automatically flagged violations according to the taxonomy (e.g., “forgot previous variable name”, “ignored user‑specified API version”).

  4. Mitigation design (InCE)

    • Invariant extraction: a dedicated agent parses the whole conversation to infer constraints that should stay constant (e.g., function signatures, library versions).
    • Constraint evolution: as the dialogue proceeds, the invariant set is updated but never contradicted.
    • Pre‑generation audit: before the main code‑generation model produces output, a lightweight verifier checks the proposed response against the invariant set, rejecting or prompting a rewrite if a smell is detected.
  5. Benchmarking – The enhanced pipeline was run on WildBench, a curated suite of multi‑turn coding tasks, measuring both functional correctness (Task Success Rate) and smell frequency.

Results & Findings

ModelBaseline Task Success RateSuccess Rate with InCESmell Reduction
GPT‑4o71.2 %78.5 %–28 %
DeepSeek‑Chat64.8 %71.3 %–31 %
Gemini 2.568.0 %74.9 %–27 %
Qwen2.5‑32B60.5 %66.8 %–30 %
Qwen2.5‑72B62.1 %68.9 %–29 %
Qwen3‑235B‑a22b73.4 %80.2 %–26 %

Key takeaways

  • Interaction Smells are common: every model exhibited at least one smell in ~40 % of turns, with “Historical Instruction Compliance” (ignoring earlier user constraints) being the most frequent.
  • Model size matters, but not decisively: larger Qwen variants performed better than their smaller counterpart, yet GPT‑4o still led despite a similar parameter count.
  • InCE is model‑agnostic: the same mitigation layer improved all six models, proving that the problem is largely orthogonal to raw language capability.
  • Overhead is low: the additional audit step added ~0.15 seconds per turn, negligible for most developer workflows.

Practical Implications

  • Better IDE assistants – Embedding an InCE‑style guard in VS Code or JetBrains plugins could catch “forgotten variable” or “wrong API version” errors before they surface, reducing debugging time.
  • More reliable CI pipelines – Automated code‑review bots that use LLMs can adopt invariant extraction to enforce project‑wide policies (e.g., coding style, dependency constraints) without manual rule‑writing.
  • Improved onboarding for junior devs – Learning tools that pair a student with an LLM can surface interaction smells as teaching moments, highlighting where the model lost context.
  • Vendor‑agnostic safety layer – Since InCE works across models, SaaS providers can add it on top of any backend LLM (OpenAI, Anthropic, etc.) to guarantee a baseline interaction quality.
  • Metrics for product teams – Interaction Smell rates give a more nuanced health indicator than “pass/fail unit tests,” helping product managers track conversational robustness over releases.

Limitations & Future Work

  • Taxonomy scope – The nine smell categories were derived from English‑centric, open‑source chats; niche domains (e.g., embedded C, GPU kernels) may expose additional patterns.
  • Static invariant extraction – Current agents infer invariants from explicit mentions; implicit constraints (e.g., performance budgets) remain undetected.
  • Evaluation on synthetic benchmarks – While WildBench is realistic, broader field studies (e.g., in‑company codebases) are needed to confirm external validity.
  • Scalability of multi‑agent orchestration – As conversation length grows to hundreds of turns, maintaining and checking a large invariant set could become costly; future work may explore hierarchical or learned constraint representations.
  • User experience – The audit step can occasionally reject a plausible answer, prompting an extra turn; designing smoother “explain‑and‑re‑ask” interactions is an open research direction.

Authors

  • Binquan Zhang
  • Li Zhang
  • Lin Shi
  • Song Wang
  • Yuwei Qian
  • Linhui Zhao
  • Fang Liu
  • An Fu
  • Yida Ye

Paper Information

  • arXiv ID: 2603.09701v1
  • Categories: cs.SE
  • Published: March 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »