[Paper] SWE-chat: Coding Agent Interactions From Real Users in the Wild

Published: (April 22, 2026 at 01:08 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.20779v1

Overview

The paper introduces SWE‑chat, the first large‑scale, continuously‑updated dataset of real‑world interactions between developers and AI coding agents. By harvesting thousands of open‑source sessions straight from public repositories, the authors give the community concrete evidence about how these assistants are actually used—and where they still fall short.

Key Contributions

  • A living dataset of ~6 K coding‑agent sessions (≈63 K user prompts, 355 K agent tool calls) that is automatically refreshed from public codebases.
  • Authorship attribution for every line of code, enabling precise measurement of how much code is produced by the AI vs. the human.
  • Empirical characterization of usage patterns, revealing a bimodal “vibe‑coding” behavior where agents write almost all committed code in 41 % of sessions.
  • Failure‑mode analysis, showing that only 44 % of AI‑generated code survives to commit and that AI‑written code carries more security vulnerabilities.
  • Interaction dynamics metrics, quantifying how often developers interrupt, correct, or reject agent outputs (44 % of turns).

Methodology

  1. Data collection pipeline – The authors built a scraper that continuously monitors public GitHub repositories for files matching the typical structure of AI‑assistant logs (e.g., *.swechat.json).
  2. Session reconstruction – Raw logs are parsed into turn‑by‑turn dialogues, linking each user prompt to the subsequent agent tool calls (e.g., search, edit, run).
  3. Authorship labeling – By tracking which turn generated each diff, the system tags every added line as human‑authored or agent‑authored.
  4. Static analysis – All committed code is run through security scanners (e.g., CodeQL) to compare vulnerability rates between human‑ and AI‑written snippets.
  5. Statistical analysis – The team computes distributions of session length, tool‑call frequency, and “survival” rates (how often AI‑generated code makes it into the final commit).

The pipeline is open‑source, so the dataset can keep growing as more developers adopt AI assistants.

Results & Findings

MetricObservation
Session coding style41 % of sessions are “vibe coding” (agents write almost all committed code); 23 % are fully human‑written.
Code survivalOnly 44 % of AI‑produced lines survive to the final commit; the rest are edited or discarded.
SecurityAI‑written code exhibits a higher vulnerability density (≈1.8× more issues per LOC) than human code.
Developer push‑backIn 44 % of dialogue turns, developers intervene—by correcting, reporting failures, or aborting the agent’s suggestion.
Tool‑call volumeAgents make an average of ≈ 60 tool calls per session, indicating heavy reliance on external actions (search, test, refactor).

These numbers paint a nuanced picture: while AI assistants can take the lead in many projects, they are still far from autonomous, and developers spend a lot of effort curating the output.

Practical Implications

  • Tool designers should prioritize guardrails—e.g., automatic security linting of AI‑generated patches—to mitigate the higher vulnerability risk.
  • IDE integrations can surface survival‑rate metrics in real time, warning developers when the assistant’s suggestions are frequently rejected.
  • Workflow automation can be tuned to reduce unnecessary tool calls; the high call volume suggests many “trial‑and‑error” loops that could be streamlined.
  • Team leads may adopt policies that require human review of any AI‑generated commit, especially for security‑critical modules.
  • Benchmarking research should move beyond synthetic tasks and evaluate agents on datasets like SWE‑chat that reflect true developer behavior.

Limitations & Future Work

  • Dataset bias: The collection focuses on open‑source repositories that expose agent logs, potentially under‑representing private or enterprise usage patterns.
  • Tool‑call granularity: Some agents bundle multiple actions into a single call, making it harder to attribute fine‑grained effort.
  • Security analysis scope: Static scanners catch many issues but may miss runtime‑only vulnerabilities; deeper dynamic analysis is a next step.
  • Long‑term evolution: As coding agents improve, the bimodal usage pattern may shift; continuous monitoring will be needed to track trends.

The authors plan to expand SWE‑chat to cover more languages, incorporate runtime performance data, and open up a leaderboard for real‑world agent evaluation.

Authors

  • Joachim Baumann
  • Vishakh Padmakumar
  • Xiang Li
  • John Yang
  • Diyi Yang
  • Sanmi Koyejo

Paper Information

  • arXiv ID: 2604.20779v1
  • Categories: cs.AI, cs.CY, cs.SE
  • Published: April 22, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »