[Paper] An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

Published: 17 hours ago (March 10, 2026 at 10:12 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.09701v1

Overview

This paper investigates why large language models (LLMs) that generate code often stumble during long, back‑and‑forth conversations with developers. The authors call these hidden problems Interaction Smells—subtle breakdowns in the dialogue that can derail a coding task even when the final code snippet looks correct. By cataloguing these smells, measuring how often they appear in popular LLMs, and proposing a lightweight fix, the work bridges a gap between academic benchmarks and the day‑to‑day experience of developers using AI‑assisted coding tools.

Key Contributions

First taxonomy of Interaction Smells for human‑LLM code generation, organized into three high‑level groups (User Intent Quality, Historical Instruction Compliance, Historical Response Violation) and nine concrete sub‑categories.
Large‑scale empirical analysis on real‑world multi‑turn chats from the WildChat and LMSYS‑Chat‑1M datasets, covering six widely used LLMs (GPT‑4o, DeepSeek‑Chat, Gemini 2.5, Qwen2.5‑32B/72B, Qwen3‑235B‑a22b).
Quantitative distribution report showing which models are more prone to specific smells and how the problem evolves over conversation length.
Invariant‑aware Constraint Evolution (InCE), a multi‑agent framework that extracts global “invariants” (e.g., naming conventions, API contracts) from the whole dialogue and runs a pre‑generation audit to catch violations before code is emitted.
Empirical validation on the extended WildBench benchmark, demonstrating a measurable lift in Task Success Rate and a strong reduction in the frequency of Interaction Smells with only a modest overhead.

Methodology

Data collection – The authors sampled multi‑turn programming sessions from two public corpora:
- WildChat – real developer‑LLM chats harvested from open‑source repositories.
- LMSYS‑Chat‑1M – a large, diverse set of human‑LLM conversations.
Taxonomy building – Using open card‑sorting, three researchers manually inspected hundreds of interaction logs, iteratively clustering recurring issues until a stable set of nine smell types emerged.
Model evaluation – Six LLMs were prompted with the same interaction histories. For each turn, the system automatically flagged violations according to the taxonomy (e.g., “forgot previous variable name”, “ignored user‑specified API version”).
Mitigation design (InCE) –
- Invariant extraction: a dedicated agent parses the whole conversation to infer constraints that should stay constant (e.g., function signatures, library versions).
- Constraint evolution: as the dialogue proceeds, the invariant set is updated but never contradicted.
- Pre‑generation audit: before the main code‑generation model produces output, a lightweight verifier checks the proposed response against the invariant set, rejecting or prompting a rewrite if a smell is detected.
Benchmarking – The enhanced pipeline was run on WildBench, a curated suite of multi‑turn coding tasks, measuring both functional correctness (Task Success Rate) and smell frequency.

Results & Findings

Model	Baseline Task Success Rate	Success Rate with InCE	Smell Reduction
GPT‑4o	71.2 %	78.5 %	–28 %
DeepSeek‑Chat	64.8 %	71.3 %	–31 %
Gemini 2.5	68.0 %	74.9 %	–27 %
Qwen2.5‑32B	60.5 %	66.8 %	–30 %
Qwen2.5‑72B	62.1 %	68.9 %	–29 %
Qwen3‑235B‑a22b	73.4 %	80.2 %	–26 %

Key takeaways

Interaction Smells are common: every model exhibited at least one smell in ~40 % of turns, with “Historical Instruction Compliance” (ignoring earlier user constraints) being the most frequent.
Model size matters, but not decisively: larger Qwen variants performed better than their smaller counterpart, yet GPT‑4o still led despite a similar parameter count.
InCE is model‑agnostic: the same mitigation layer improved all six models, proving that the problem is largely orthogonal to raw language capability.
Overhead is low: the additional audit step added ~0.15 seconds per turn, negligible for most developer workflows.

Practical Implications

Better IDE assistants – Embedding an InCE‑style guard in VS Code or JetBrains plugins could catch “forgotten variable” or “wrong API version” errors before they surface, reducing debugging time.
More reliable CI pipelines – Automated code‑review bots that use LLMs can adopt invariant extraction to enforce project‑wide policies (e.g., coding style, dependency constraints) without manual rule‑writing.
Improved onboarding for junior devs – Learning tools that pair a student with an LLM can surface interaction smells as teaching moments, highlighting where the model lost context.
Vendor‑agnostic safety layer – Since InCE works across models, SaaS providers can add it on top of any backend LLM (OpenAI, Anthropic, etc.) to guarantee a baseline interaction quality.
Metrics for product teams – Interaction Smell rates give a more nuanced health indicator than “pass/fail unit tests,” helping product managers track conversational robustness over releases.

Limitations & Future Work

Taxonomy scope – The nine smell categories were derived from English‑centric, open‑source chats; niche domains (e.g., embedded C, GPU kernels) may expose additional patterns.
Static invariant extraction – Current agents infer invariants from explicit mentions; implicit constraints (e.g., performance budgets) remain undetected.
Evaluation on synthetic benchmarks – While WildBench is realistic, broader field studies (e.g., in‑company codebases) are needed to confirm external validity.
Scalability of multi‑agent orchestration – As conversation length grows to hundreds of turns, maintaining and checking a large invariant set could become costly; future work may explore hierarchical or learned constraint representations.
User experience – The audit step can occasionally reject a plausible answer, prompting an extra turn; designing smoother “explain‑and‑re‑ask” interactions is an open research direction.

Authors

Binquan Zhang
Li Zhang
Lin Shi
Song Wang
Yuwei Qian
Linhui Zhao
Fang Liu
An Fu
Yida Ye

Paper Information

arXiv ID: 2603.09701v1
Categories: cs.SE
Published: March 10, 2026
PDF: Download PDF

[Paper] An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Preparing Students for AI-Driven Agile Development: A Project-Based AI Engineering Curriculum

[Paper] EmbC-Test: How to Speed Up Embedded Software Testing Using LLMs and RAG

[Paper] Towards Viewpoint-centric Artifact-based Regulatory Requirements Engineering for Compliance by Design

[Paper] Experience Report on the Adaptable Integration of Requirements Engineering Courses into Curricula for Professionals