[Paper] An Investigation on How AI-Generated Responses Affect SoftwareEngineering Surveys
Source: arXiv - 2512.17455v1
Overview
Survey research is a cornerstone of empirical software‑engineering studies, but the rise of large language models (LLMs) such as ChatGPT is opening a new attack surface: participants can now generate “plausible” answers with a few keystrokes. This paper investigates how AI‑generated responses are already contaminating real‑world SE surveys and what that means for the credibility of the data we rely on.
Key Contributions
- Empirical evidence of AI misuse – Detected 49 AI‑crafted responses across two 2025 Prolific surveys targeting software engineers.
- Pattern taxonomy – Identified recurring structural cues (repetitive sequencing, uniform phrasing, shallow personalization) that signal synthetic authorship.
- Validity framework extension – Proposes “data authenticity” as a new dimension of validity for SE surveys, alongside construct, internal, and external validity.
- Hybrid detection workflow – Combines manual qualitative inspection with automated tools (Scribbr AI Detector) to flag suspicious answers.
- Guidelines for researchers – Offers concrete recommendations for survey design, reporting, and community standards to mitigate AI‑generated noise.
Methodology
- Survey deployment – Two separate questionnaires were run on the Prolific crowd‑sourcing platform in early 2025, each gathering several hundred responses from self‑identified software professionals.
- Screening for anomalies – Researchers first looked for outliers (e.g., unusually fast completion times, identical answer strings) and then performed a deeper qualitative read‑through of suspect submissions.
- Pattern analysis – The team catalogued linguistic and structural traits that repeatedly appeared in the flagged answers (e.g., “In my experience, …” followed by generic statements).
- Automated detection – All responses were fed into the Scribbr AI Detector, a classifier trained to distinguish human‑written text from LLM‑generated text. The detector’s confidence scores were cross‑checked with the manual findings.
- Validity assessment – The impact of the identified AI responses on the survey’s construct, internal, and external validity was evaluated, leading to the proposal of “data authenticity” as an additional validity lens.
Results & Findings
- 49 out of ~800 responses (≈6%) showed strong evidence of AI generation.
- Structural signatures such as perfectly parallel sentence structures, repeated use of filler phrases (“as far as I know”), and lack of concrete personal anecdotes were the most reliable human‑detectable cues.
- The Scribbr AI Detector flagged 92 % of the manually identified AI responses with confidence > 0.85, while also surfacing a few borderline cases that required human judgment.
- Presence of AI‑generated answers degraded construct validity (the measured constructs no longer reflected true practitioner beliefs) and threatened internal validity (spurious correlations could be introduced).
- The authors argue that data authenticity—the guarantee that each datum originates from a genuine human respondent—must now be treated as a first‑class validity concern.
Practical Implications
- Survey designers should embed “human‑verification” steps, such as open‑ended prompts that require personal context (e.g., “Describe a recent bug you fixed”) and time‑based checks to discourage rapid, AI‑driven completion.
- Tool builders can integrate AI‑detectors directly into survey platforms (Qualtrics, Google Forms, etc.) to provide real‑time alerts for suspicious submissions.
- Researchers need to disclose detection methods and authenticity metrics in their publications, fostering transparency and reproducibility.
- Industry practitioners who rely on survey‑based benchmarks (e.g., developer productivity tools, CI/CD adoption rates) should treat published results with a healthy dose of skepticism until authenticity safeguards become standard.
- Community standards (e.g., ACM SIGSOFT, IEEE) may soon require a “data authenticity statement” as part of conference paper submissions involving surveys.
Limitations & Future Work
- The study focused on a single crowdsourcing platform (Prolific) and two surveys; results may differ on other recruitment channels (e.g., GitHub, Stack Overflow).
- Detection relied on a proprietary AI detector (Scribbr); its performance on newer LLMs (e.g., GPT‑4‑Turbo, Claude 3) remains untested.
- The authors acknowledge a false‑positive risk—some genuine respondents may write in a concise, formulaic style that mimics AI output.
- Future research directions include: building open‑source, domain‑specific AI‑detectors for SE, exploring adversarial prompting techniques that LLMs might use to evade detection, and developing longitudinal studies to track how AI misuse evolves as LLMs become more accessible.
Authors
- Ronnie de Souza Santos
- Italo Santos
- Maria Teresa Baldassarre
- Cleyton Magalhaes
- Mairieli Wessel
Paper Information
- arXiv ID: 2512.17455v1
- Categories: cs.SE
- Published: December 19, 2025
- PDF: Download PDF