[Paper] BAMI: Training-Free Bias Mitigation in GUI Grounding

Published: (May 7, 2026 at 01:59 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.06664v1

Overview

The paper introduces BAMI, a training‑free technique that dramatically reduces two hidden sources of error—precision bias from high‑resolution screenshots and ambiguity bias from crowded UI elements—when grounding graphical user interfaces (GUIs). By plugging BAMI into existing GUI‑grounding models, developers can boost performance on challenging benchmarks like ScreenSpot‑Pro without retraining any models.

Key Contributions

  • Bias diagnosis with Masked Prediction Distribution (MPD): a novel attribution tool that pinpoints precision and ambiguity biases in GUI grounding pipelines.
  • Bias‑Aware Manipulation Inference (BAMI): a lightweight, inference‑only framework that applies two manipulations—coarse‑to‑fine focus and candidate selection—to counteract the identified biases.
  • Training‑free performance gains: demonstrated across multiple state‑of‑the‑art models (e.g., TianXi‑Action‑7B) with up to +6 % absolute accuracy on the ScreenSpot‑Pro benchmark.
  • Robustness validated by extensive ablations: showing stable improvements across a wide range of hyper‑parameter settings.
  • Open‑source implementation: the authors release code and scripts, making it easy for practitioners to adopt BAMI in their pipelines.

Methodology

  1. Detecting bias with MPD – The authors mask random patches of a GUI screenshot and observe how the model’s prediction distribution changes. Large shifts reveal where the model is overly sensitive (precision bias) or confused (ambiguity bias).
  2. Coarse‑to‑fine focus – Instead of feeding the full‑resolution image directly, BAMI first runs the model on a down‑sampled (coarse) version to locate the general region of interest, then refines the prediction on a high‑resolution crop of that region. This reduces the precision bias caused by unnecessary pixel‑level detail.
  3. Candidate selection – For UI elements that look similar (e.g., multiple buttons with the same icon), BAMI generates a short list of plausible candidates from the coarse pass and re‑ranks them using a lightweight similarity score that incorporates textual cues (labels, tooltips). This mitigates ambiguity bias without any extra training.
  4. Inference‑only pipeline – All steps are performed at test time; no gradients are computed, and no model weights are altered. The approach can be wrapped around any existing GUI‑grounding model as a drop‑in post‑processor.

Results & Findings

Model (baseline)Accuracy on ScreenSpot‑ProAccuracy with BAMIΔ
TianXi‑Action‑7B51.9 %57.8 %+5.9 %
Other SOTA models48–53 %53–58 %+4–6 %
  • Consistent gains across all tested models, confirming that the biases are model‑agnostic.
  • Ablation studies show that removing either the coarse‑to‑fine focus or the candidate selection drops performance back to near‑baseline, proving that both components are essential.
  • Parameter stability: varying the down‑sampling factor (2×–8×) or the candidate list size (3–7) only changes results by ≤0.5 %, indicating that BAMI works out‑of‑the‑box with minimal tuning.

Practical Implications

  • Faster deployment: Teams can improve existing GUI‑automation agents (e.g., test‑automation bots, accessibility tools) without costly retraining cycles.
  • Higher reliability in production: Reducing precision bias means fewer missed clicks on high‑DPI screens; mitigating ambiguity bias cuts down on wrong‑element selections in dense dashboards.
  • Plug‑and‑play for heterogeneous UIs: Because BAMI operates purely at inference, it can be added to pipelines that already support multiple device form‑factors (mobile, desktop, web).
  • Cost‑effective scaling: Organizations can roll out upgraded agents across thousands of machines by simply updating the inference wrapper, avoiding GPU‑intensive fine‑tuning.
  • Open‑source integration: The provided GitHub repo includes ready‑made wrappers for popular frameworks (PyTorch, TensorFlow), making it straightforward to embed BAMI into CI/CD testing suites or RPA platforms.

Limitations & Future Work

  • Dependence on visual quality: Extremely low‑resolution screenshots may still hinder the coarse‑to‑fine step, as the initial region‑proposal becomes noisy.
  • Limited to static GUIs: The current design assumes a single static frame; extending BAMI to video‑based interactions (e.g., drag‑and‑drop animations) remains an open challenge.
  • Candidate selection heuristics: While effective, the similarity scoring relies on textual metadata; GUIs lacking accessible labels may see reduced gains.
  • Future directions include integrating lightweight OCR to enrich textual cues, exploring adaptive down‑sampling strategies based on UI complexity, and applying BAMI to multimodal agents that combine speech commands with visual grounding.

Authors

  • Borui Zhang
  • Bo Zhang
  • Bo Wang
  • Wenzhao Zheng
  • Yuhao Cheng
  • Liang Tang
  • Yiqiang Yan
  • Jie Zhou
  • Jiwen Lu

Paper Information

  • arXiv ID: 2605.06664v1
  • Categories: cs.CV, cs.AI
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...