[Paper] SWE-chat: Coding Agent Interactions From Real Users in the Wild

Published: 3 days ago (April 22, 2026 at 01:08 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.20779v1

Overview

The paper introduces SWE‑chat, the first large‑scale, continuously‑updated dataset of real‑world interactions between developers and AI coding agents. By harvesting thousands of open‑source sessions straight from public repositories, the authors give the community concrete evidence about how these assistants are actually used—and where they still fall short.

Key Contributions

A living dataset of ~6 K coding‑agent sessions (≈63 K user prompts, 355 K agent tool calls) that is automatically refreshed from public codebases.
Authorship attribution for every line of code, enabling precise measurement of how much code is produced by the AI vs. the human.
Empirical characterization of usage patterns, revealing a bimodal “vibe‑coding” behavior where agents write almost all committed code in 41 % of sessions.
Failure‑mode analysis, showing that only 44 % of AI‑generated code survives to commit and that AI‑written code carries more security vulnerabilities.
Interaction dynamics metrics, quantifying how often developers interrupt, correct, or reject agent outputs (44 % of turns).

Methodology

Data collection pipeline – The authors built a scraper that continuously monitors public GitHub repositories for files matching the typical structure of AI‑assistant logs (e.g., *.swechat.json).
Session reconstruction – Raw logs are parsed into turn‑by‑turn dialogues, linking each user prompt to the subsequent agent tool calls (e.g., search, edit, run).
Authorship labeling – By tracking which turn generated each diff, the system tags every added line as human‑authored or agent‑authored.
Static analysis – All committed code is run through security scanners (e.g., CodeQL) to compare vulnerability rates between human‑ and AI‑written snippets.
Statistical analysis – The team computes distributions of session length, tool‑call frequency, and “survival” rates (how often AI‑generated code makes it into the final commit).

The pipeline is open‑source, so the dataset can keep growing as more developers adopt AI assistants.

Results & Findings

Metric	Observation
Session coding style	41 % of sessions are “vibe coding” (agents write almost all committed code); 23 % are fully human‑written.
Code survival	Only 44 % of AI‑produced lines survive to the final commit; the rest are edited or discarded.
Security	AI‑written code exhibits a higher vulnerability density (≈1.8× more issues per LOC) than human code.
Developer push‑back	In 44 % of dialogue turns, developers intervene—by correcting, reporting failures, or aborting the agent’s suggestion.
Tool‑call volume	Agents make an average of ≈ 60 tool calls per session, indicating heavy reliance on external actions (search, test, refactor).

These numbers paint a nuanced picture: while AI assistants can take the lead in many projects, they are still far from autonomous, and developers spend a lot of effort curating the output.

Practical Implications

Tool designers should prioritize guardrails—e.g., automatic security linting of AI‑generated patches—to mitigate the higher vulnerability risk.
IDE integrations can surface survival‑rate metrics in real time, warning developers when the assistant’s suggestions are frequently rejected.
Workflow automation can be tuned to reduce unnecessary tool calls; the high call volume suggests many “trial‑and‑error” loops that could be streamlined.
Team leads may adopt policies that require human review of any AI‑generated commit, especially for security‑critical modules.
Benchmarking research should move beyond synthetic tasks and evaluate agents on datasets like SWE‑chat that reflect true developer behavior.

Limitations & Future Work

Dataset bias: The collection focuses on open‑source repositories that expose agent logs, potentially under‑representing private or enterprise usage patterns.
Tool‑call granularity: Some agents bundle multiple actions into a single call, making it harder to attribute fine‑grained effort.
Security analysis scope: Static scanners catch many issues but may miss runtime‑only vulnerabilities; deeper dynamic analysis is a next step.
Long‑term evolution: As coding agents improve, the bimodal usage pattern may shift; continuous monitoring will be needed to track trends.

The authors plan to expand SWE‑chat to cover more languages, incorporate runtime performance data, and open up a leaderboard for real‑world agent evaluation.

Authors

Joachim Baumann
Vishakh Padmakumar
Xiang Li
John Yang
Diyi Yang
Sanmi Koyejo

Paper Information

arXiv ID: 2604.20779v1
Categories: cs.AI, cs.CY, cs.SE
Published: April 22, 2026
PDF: Download PDF

[Paper] SWE-chat: Coding Agent Interactions From Real Users in the Wild

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

[Paper] Fine-Tuning Regimes Define Distinct Continual Learning Problems

[Paper] The Sample Complexity of Multicalibration