[Paper] Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches
Source: arXiv - 2602.08561v1
Overview
Reproducing computational research in the social sciences is often far more painful than simply re‑running a script—missing packages, broken file paths, and hidden assumptions can derail the process. This paper investigates whether large language models (LLMs) and autonomous AI agents can automatically diagnose and fix such failures, turning a tedious manual debugging session into a largely hands‑off workflow.
Key Contributions
- Controlled reproducibility testbed built from five fully reproducible R‑based social‑science studies, with realistic failure injection covering a spectrum from trivial missing libraries to complex logical gaps.
- Two automated repair pipelines:
- Prompt‑based workflow that iteratively queries an LLM with structured prompts of varying context.
- Agent‑based workflow where an autonomous “AI programmer” inspects the repository, edits code, and re‑executes the analysis inside a clean Docker container.
- Empirical comparison of the two approaches across 15 failure scenarios, revealing that agent‑based systems achieve 69‑96 % success while prompt‑only methods range from 31‑79 %, with performance heavily tied to prompt richness and error complexity.
- Open‑source release of the testbed, failure injection scripts, and the automation pipelines, enabling the community to benchmark future reproducibility tools.
Methodology
- Testbed Construction – The authors selected five published R projects that originally reproduced perfectly. Each project was containerized (Docker) to guarantee a clean baseline environment.
- Failure Injection – For each project, three categories of faults were introduced:
- Simple: missing CRAN packages or broken file paths.
- Intermediate: version mismatches and deprecated functions.
- Complex: omitted data‑preprocessing steps or hidden business logic.
- Prompt‑Based Pipeline – A large language model (e.g., GPT‑4) receives a series of prompts that include the error traceback, a snippet of the failing script, and optionally the full repository tree. The model suggests a fix, which is applied automatically, then the script is rerun. This loop repeats up to three times per failure.
- Agent‑Based Pipeline – An autonomous AI agent (implemented with LangChain‑style tooling) can:
- List files, read/write code, install packages, and execute shell commands.
- Maintain an internal state of what it has tried, allowing it to backtrack and explore alternative fixes.
- Operate fully within the Docker container, iterating until the analysis completes without error or a timeout is reached.
- Evaluation – Success is defined as the final run producing the exact same output (tables, figures, statistics) as the original, reproducible baseline. The authors record success rates, number of LLM calls, and total wall‑clock time.
Results & Findings
| Failure Complexity | Prompt‑Based Success | Agent‑Based Success |
|---|---|---|
| Simple | 79 % | 96 % |
| Intermediate | 55 % | 84 % |
| Complex | 31 % | 69 % |
- Prompt context matters: Adding the full repository tree and a concise description of the intended analysis boosted prompt‑based success on complex cases from 22 % to 41 %.
- Agent autonomy pays off: The ability to install missing packages, modify import statements, and retry automatically gave agents a clear edge, especially when multiple interdependent fixes were required.
- Efficiency: Prompt‑based runs typically needed fewer LLM calls (average 2.1 per failure) but took longer overall due to repeated container restarts. Agent runs made more calls (average 4.3) but converged faster because they could keep the container alive and cache intermediate results.
- Error diagnostics: Both approaches correctly identified the root cause in >90 % of successful cases, suggesting that LLMs have a solid understanding of R error messages and package ecosystems.
Practical Implications
- Developer tooling – IDE plugins or CI/CD steps could embed an “auto‑repair” mode that triggers an LLM or agent when a build fails due to missing dependencies or path errors, dramatically reducing onboarding friction for new contributors.
- Research reproducibility platforms – Services like Open Science Framework or reproducibility badges could automatically run an agent‑based validator on submitted code, flagging fixable issues before human review.
- Education – Teaching computational social science can incorporate these agents as “virtual lab assistants,” allowing students to focus on methodological questions rather than environment setup.
- Enterprise data pipelines – Organizations that rely on legacy R scripts for analytics can deploy an agent to continuously monitor and self‑heal broken pipelines caused by library updates or server migrations.
Limitations & Future Work
- Domain specificity – The study only examined R projects in social science; performance on Python, Julia, or mixed‑language stacks remains unknown.
- Error scope – Injected failures were synthetically generated; real‑world bugs may involve subtle statistical mis‑specifications that are harder for LLMs to detect.
- Resource cost – Agent‑based runs consume more API calls and compute, which could be prohibitive at large scale without optimized prompting or on‑premise LLMs.
- Explainability – While agents can fix code, they do not currently produce human‑readable rationales for each change, a gap that future work should address to build trust among researchers.
Bottom line: Automated, AI‑driven repair pipelines—especially those that give the system agency to explore and edit code—show strong promise for turning reproducibility from a manual chore into a scalable service. As LLMs become more capable and tooling matures, we can expect these approaches to become standard components of the research software stack.
Authors
- Syed Mehtab Hussain Shah
- Frank Hopfgartner
- Arnim Bleier
Paper Information
- arXiv ID: 2602.08561v1
- Categories: cs.SE, cs.CL
- Published: February 9, 2026
- PDF: Download PDF