[Paper] FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair

Published: 2 days ago (March 18, 2026 at 11:24 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.17826v1

Overview

The paper introduces FailureMem, a new multimodal framework for automated software repair that can understand code, textual bug reports, and GUI screenshots all at once. By letting the system learn from its own past failures, FailureMem pushes the state‑of‑the‑art in “debug‑as‑you‑code” tools and shows measurable gains over existing approaches.

Key Contributions

Hybrid workflow‑agent architecture – combines a structured pipeline for locating the buggy UI region with a flexible LLM‑driven reasoning agent that can explore alternative fixes.
Active perception & region‑level visual grounding – instead of processing whole‑page screenshots, the system isolates relevant UI components (buttons, dialogs, etc.) to focus its visual reasoning.
Failure Memory Bank – automatically records unsuccessful repair attempts, extracts actionable patterns, and re‑uses them as guidance for future bugs.
Empirical improvement – on the SWE‑bench Multimodal benchmark, FailureMem raises the bug‑resolution rate by 3.7 % over the prior best system (GUIRepair).

Methodology

Input Fusion – The framework ingests three modalities: (a) the source code of the affected component, (b) a natural‑language issue description, and (c) a screenshot of the running UI.
Hybrid Pipeline
- Localization stage: a lightweight visual detector scans the screenshot and proposes candidate UI regions (e.g., a mis‑aligned button).
- Reasoning stage: a large language model (LLM) receives the localized region, the relevant code snippets, and the textual bug report. It generates candidate patches, which are then compiled and tested.
Active Perception Loop – If the generated patch fails, the system can request a new visual focus (e.g., zoom into a different widget) without restarting the whole pipeline.
Failure Memory Bank – Each failed attempt is logged with its context (code, description, visual region) and the reason for failure (e.g., compilation error, test failure). A similarity matcher retrieves past failures that match the current bug, providing the LLM with “what not to do” hints.
Iterative Repair – The LLM iterates, using both fresh reasoning and the Failure Memory guidance, until a patch passes all tests or a timeout is reached.

Results & Findings

Metric	GUIRepair (baseline)	FailureMem
Resolved bugs (SWE‑bench Multimodal)	62.3 %	66.0 %
Average number of repair iterations	4.2	3.1
Time to first successful patch (seconds)	28.7	22.5

The 3.7 % absolute lift translates to ≈ 60 additional bugs fixed in the benchmark suite.
FailureMem needed ~30 % fewer iterations on average, showing that the Failure Memory Bank effectively prunes unproductive search paths.
Region‑level visual grounding reduced irrelevant visual noise, leading to clearer prompts for the LLM and faster convergence.

Practical Implications

Developer tooling – Integrated into IDEs, FailureMem could suggest UI‑related fixes on the fly, automatically surfacing the exact widget that needs attention.
Continuous integration pipelines – Teams can plug the framework into CI to auto‑repair UI regressions before they reach production, cutting down on manual triage.
Knowledge retention – The Failure Memory Bank acts like an institutional memory, turning each failed fix into a reusable lesson, which is especially valuable for large, fast‑moving codebases.
Cross‑modal debugging – By jointly reasoning over code, text, and screenshots, the system bridges the gap between front‑end designers and back‑end engineers, fostering smoother collaboration.

Limitations & Future Work

Scalability of visual detection – The current region detector works well on desktop‑style GUIs but may struggle with highly dynamic or mobile layouts.
Memory bank size management – As the number of logged failures grows, efficient indexing and pruning strategies will be needed to keep retrieval fast.
Generalization beyond UI bugs – The framework is tuned for GUI‑related defects; extending it to non‑visual bugs (e.g., performance regressions) remains an open challenge.
User study – The paper evaluates on benchmark data; real‑world developer studies are planned to assess usability and trust in the suggested patches.

FailureMem demonstrates that letting automated repair systems learn from their own mistakes—and from the visual context of the software they’re fixing—can meaningfully boost both efficiency and success rates, paving the way for smarter, multimodal debugging assistants.

Authors

Ruize Ma
Yilei Jiang
Shilin Zhang
Zheng Ma
Yi Feng
Vincent Ng
Zhi Wang
Xiangyu Yue
Chuanyi Li
Lewei Lu

Paper Information

arXiv ID: 2603.17826v1
Categories: cs.SE, cs.AI
Published: March 18, 2026
PDF: Download PDF

[Paper] FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Spectrally-Guided Diffusion Noise Schedules