[Paper] FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair

Published: (March 18, 2026 at 11:24 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.17826v1

Overview

The paper introduces FailureMem, a new multimodal framework for automated software repair that can understand code, textual bug reports, and GUI screenshots all at once. By letting the system learn from its own past failures, FailureMem pushes the state‑of‑the‑art in “debug‑as‑you‑code” tools and shows measurable gains over existing approaches.

Key Contributions

  • Hybrid workflow‑agent architecture – combines a structured pipeline for locating the buggy UI region with a flexible LLM‑driven reasoning agent that can explore alternative fixes.
  • Active perception & region‑level visual grounding – instead of processing whole‑page screenshots, the system isolates relevant UI components (buttons, dialogs, etc.) to focus its visual reasoning.
  • Failure Memory Bank – automatically records unsuccessful repair attempts, extracts actionable patterns, and re‑uses them as guidance for future bugs.
  • Empirical improvement – on the SWE‑bench Multimodal benchmark, FailureMem raises the bug‑resolution rate by 3.7 % over the prior best system (GUIRepair).

Methodology

  1. Input Fusion – The framework ingests three modalities: (a) the source code of the affected component, (b) a natural‑language issue description, and (c) a screenshot of the running UI.
  2. Hybrid Pipeline
    • Localization stage: a lightweight visual detector scans the screenshot and proposes candidate UI regions (e.g., a mis‑aligned button).
    • Reasoning stage: a large language model (LLM) receives the localized region, the relevant code snippets, and the textual bug report. It generates candidate patches, which are then compiled and tested.
  3. Active Perception Loop – If the generated patch fails, the system can request a new visual focus (e.g., zoom into a different widget) without restarting the whole pipeline.
  4. Failure Memory Bank – Each failed attempt is logged with its context (code, description, visual region) and the reason for failure (e.g., compilation error, test failure). A similarity matcher retrieves past failures that match the current bug, providing the LLM with “what not to do” hints.
  5. Iterative Repair – The LLM iterates, using both fresh reasoning and the Failure Memory guidance, until a patch passes all tests or a timeout is reached.

Results & Findings

MetricGUIRepair (baseline)FailureMem
Resolved bugs (SWE‑bench Multimodal)62.3 %66.0 %
Average number of repair iterations4.23.1
Time to first successful patch (seconds)28.722.5
  • The 3.7 % absolute lift translates to ≈ 60 additional bugs fixed in the benchmark suite.
  • FailureMem needed ~30 % fewer iterations on average, showing that the Failure Memory Bank effectively prunes unproductive search paths.
  • Region‑level visual grounding reduced irrelevant visual noise, leading to clearer prompts for the LLM and faster convergence.

Practical Implications

  • Developer tooling – Integrated into IDEs, FailureMem could suggest UI‑related fixes on the fly, automatically surfacing the exact widget that needs attention.
  • Continuous integration pipelines – Teams can plug the framework into CI to auto‑repair UI regressions before they reach production, cutting down on manual triage.
  • Knowledge retention – The Failure Memory Bank acts like an institutional memory, turning each failed fix into a reusable lesson, which is especially valuable for large, fast‑moving codebases.
  • Cross‑modal debugging – By jointly reasoning over code, text, and screenshots, the system bridges the gap between front‑end designers and back‑end engineers, fostering smoother collaboration.

Limitations & Future Work

  • Scalability of visual detection – The current region detector works well on desktop‑style GUIs but may struggle with highly dynamic or mobile layouts.
  • Memory bank size management – As the number of logged failures grows, efficient indexing and pruning strategies will be needed to keep retrieval fast.
  • Generalization beyond UI bugs – The framework is tuned for GUI‑related defects; extending it to non‑visual bugs (e.g., performance regressions) remains an open challenge.
  • User study – The paper evaluates on benchmark data; real‑world developer studies are planned to assess usability and trust in the suggested patches.

FailureMem demonstrates that letting automated repair systems learn from their own mistakes—and from the visual context of the software they’re fixing—can meaningfully boost both efficiency and success rates, paving the way for smarter, multimodal debugging assistants.

Authors

  • Ruize Ma
  • Yilei Jiang
  • Shilin Zhang
  • Zheng Ma
  • Yi Feng
  • Vincent Ng
  • Zhi Wang
  • Xiangyu Yue
  • Chuanyi Li
  • Lewei Lu

Paper Information

  • arXiv ID: 2603.17826v1
  • Categories: cs.SE, cs.AI
  • Published: March 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »