[Paper] Multi-Agent Systems for Dataset Adaptation in Software Engineering: Capabilities, Limitations, and Future Directions

Published: (November 26, 2025 at 08:26 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21380v1

Overview

The paper presents the first systematic evaluation of modern multi‑agent large language model (LLM) systems—specifically GitHub Copilot powered by GPT‑4.1 and Claude Sonnet 4—for the dataset adaptation problem in software‑engineering (SE) research. By automating the migration of research artifacts (scripts, pipelines, configuration files) from one benchmark repository to another, the authors explore how far current AI agents can go toward making SE experiments more reproducible and scalable.

Key Contributions

  • Empirical benchmark: Introduces a five‑stage evaluation pipeline (file comprehension → code editing → command generation → validation → execution) to rigorously assess multi‑agent performance on real SE datasets (ROCODE, LogHub 2.0).
  • Quantitative baseline: Shows that out‑of‑the‑box agents achieve a modest 7.25 % structural similarity to the ground‑truth adaptations, with functional correctness well below 5 %.
  • Prompt‑engineering interventions: Demonstrates that supplying agents with execution error messages and reference snippets boosts similarity to 67.14 %, revealing the power of feedback‑driven prompting.
  • Failure taxonomy: Categorizes common error patterns (missing imports, incorrect file paths, API misuse) that help pinpoint where agents stumble.
  • Roadmap for self‑correcting agents: Proposes concrete design directions—iterative self‑debugging loops, richer tool‑calling APIs, and dataset‑aware prompting—to close the gap between current capabilities and fully autonomous adaptation.

Methodology

  1. Dataset selection – The authors chose two widely used SE benchmark suites:

    • ROCODE (code‑smell detection)
    • LogHub 2.0 (log‑analysis pipelines)
  2. Task definition – For each repository, a set of “adaptation tasks” was crafted (e.g., port a Python script that reads CSV logs to a new log format).

  3. Agent configuration – Copilot was run in agent mode with two underlying LLMs: GPT‑4.1 and Claude Sonnet 4. The agents could invoke a limited toolbox (file I/O, shell commands, simple Python REPL).

  4. Five‑stage pipeline

    • File comprehension: Agent identifies relevant source files.
    • Code editing: Generates modifications to meet the target dataset’s schema.
    • Command generation: Produces build/run commands.
    • Validation: Executes the commands in a sandbox, captures errors.
    • Final execution: Checks whether the adapted artifact runs successfully and produces expected outputs.
  5. Prompt interventions – Three levels of prompting were tested:
    (a) baseline prompt,
    (b) baseline + error‑message feedback,
    (c) baseline + error + reference code snippet.

  6. Metrics – Structural similarity (AST‑based diff), functional correctness (pass/fail of unit tests), and success‑rate per pipeline stage.

Results & Findings

Prompt conditionStructural similarity ↑Functional correctness ↑
Baseline7.25 %3.1 %
+ Error feedback42.8 %15.6 %
+ Ref. code67.14 %31.2 %
  • Stage bottlenecks: The validation stage caused >60 % of failures; agents often generated syntactically correct code that crashed due to missing dependencies.
  • Error‑driven prompting proved far more effective than simply adding more description to the initial request.
  • Model differences: GPT‑4.1 slightly outperformed Claude Sonnet 4 on code‑editing, but Claude was better at generating correct shell commands.
  • Failure taxonomy highlighted three recurring themes: (1) environment mismatch, (2) API version drift, and (3) implicit assumptions about data layout.

Practical Implications

  • Accelerated reproducibility – Teams can use a guided multi‑agent workflow to migrate existing research pipelines to new datasets without manually rewriting every script, cutting onboarding time from days to hours.
  • CI/CD integration – The five‑stage pipeline maps naturally onto continuous‑integration pipelines; agents can act as “smart bots” that automatically adjust test suites when dataset schemas evolve.
  • Tooling extensions – IDE plugins could expose the “error‑feedback loop” so developers get on‑the‑fly suggestions when a script fails after a dataset change.
  • Cost‑effective scaling – For large‑scale empirical SE studies (e.g., mining thousands of repositories), automated adaptation reduces the human labor needed to keep data collection scripts up‑to‑date.

Limitations & Future Work

  • Scope of datasets – The study focuses on two SE benchmarks; results may differ for domains with more complex build systems (e.g., C/C++ projects).
  • Sandbox constraints – Agents operated in a limited toolset; richer environments (Docker, package managers) could improve success rates.
  • Self‑correction – Current agents rely on external prompts for error feedback; building truly autonomous self‑debugging loops remains an open challenge.
  • Evaluation depth – Functional correctness was measured via simple unit tests; deeper semantic validation (e.g., statistical equivalence of model outputs) is left for future research.

Bottom line: Multi‑agent LLMs are already capable of recognizing and partially adapting SE research artifacts, but unlocking reliable, end‑to‑end automation will require tighter integration of feedback mechanisms, richer tooling, and domain‑specific prompting strategies. Developers interested in reproducible SE pipelines should start experimenting with the error‑feedback prompting pattern today, while keeping an eye on the next generation of self‑correcting AI agents.

Authors

  • Jingyi Chen
  • Xiaoyan Guo
  • Songqiang Chen
  • Shing-Chi Cheung
  • Jiasi Shen

Paper Information

  • arXiv ID: 2511.21380v1
  • Categories: cs.SE
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »