[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Source: arXiv - 2602.22190v1
Overview
GUI‑Libra tackles a persistent gap between open‑source and proprietary GUI‑automation agents, especially on long‑horizon tasks such as multi‑step web or mobile workflows. By redesigning the data pipeline and the fine‑tuning / reinforcement‑learning stages, the authors show that native agents can achieve dramatically higher success rates without the need for massive online interaction data.
Key Contributions
- Curated reasoning dataset: 81 K high‑quality “reason‑then‑act” examples for web and mobile GUIs, built with a systematic construction‑and‑filtering pipeline.
- Action‑aware supervised fine‑tuning (SFT): A mixed‑data strategy that blends pure reasoning traces with direct‑action examples, plus token‑level re‑weighting that forces the model to focus on grounding actions.
- Stabilized RL under partial verifiability: Introduction of a KL‑regularized trust region for the RL‑with‑verification‑reward (RLVR) loop, plus a success‑adaptive gradient scaling that down‑weights noisy negative updates when the environment is ambiguous.
- Empirical validation: Consistent gains on several public web‑automation (e.g., MiniWoB) and mobile‑automation benchmarks, improving both step‑wise accuracy and end‑to‑end task completion.
- Open resources: Release of the 81 K dataset, training code, and pretrained models to the community.
Methodology
-
Data Construction & Filtering
- Harvested raw interaction logs from existing GUI agents and human demonstrations.
- Applied heuristic filters (action‑token consistency, language fluency, duplicate removal) to keep only traces where the natural‑language reasoning aligns tightly with the subsequent UI action.
- Result: a clean, diverse corpus covering a wide range of UI elements (buttons, dropdowns, gestures, etc.).
-
Action‑Aware Supervised Fine‑Tuning
- Instead of pure chain‑of‑thought (CoT) prompts, the training mix includes:
- Reason‑then‑action examples (text reasoning followed by the exact UI command).
- Direct‑action examples (no reasoning, just the correct UI command).
- Token‑level loss re‑weighting amplifies gradients on action tokens and UI identifiers, encouraging the model to stay grounded while still reasoning.
- Instead of pure chain‑of‑thought (CoT) prompts, the training mix includes:
-
Reinforcement Learning with Partial Verifiability (RLVR)
- Traditional step‑wise RL treats a single demonstrated action as the only “correct” one, even though many actions could be valid. This creates a partial verifiability problem that hurts offline metrics.
- GUI‑Libra adds a KL‑regularization term that penalizes the policy for drifting too far from the SFT baseline, effectively forming a trust region.
- A success‑adaptive scaling factor monitors online episode outcomes; when the agent succeeds, negative gradients from mismatched actions are attenuated, preventing over‑penalization of alternative valid moves.
-
Training Pipeline
- Stage 1: Action‑aware SFT on the curated 81 K dataset.
- Stage 2: KL‑regularized RLVR on a small set of offline trajectories, followed by a brief online fine‑tune (optional) to polish performance.
Results & Findings
| Benchmark | Baseline (SFT‑only) | GUI‑Libra (SFT + RLVR) | ↑ End‑to‑End Success |
|---|---|---|---|
| MiniWoB (web) | 48 % | 66 % | +18 pp |
| Mobile‑Env (Android) | 42 % | 61 % | +19 pp |
| Step‑wise Accuracy (average) | 71 % | 84 % | +13 pp |
- Offline metrics become predictive: The KL‑regularized RLVR correlates strongly (ρ ≈ 0.78) with online success, fixing the “partial verifiability” disconnect observed in prior work.
- Ablation studies show that removing either the action‑aware token re‑weighting or the KL trust region drops performance by ~7‑9 pp, confirming each component’s necessity.
- Data efficiency: With only ~10 K additional fine‑tuning steps, the model matches or exceeds closed‑source baselines that required millions of online interactions.
Practical Implications
- Faster prototyping of UI bots: Developers can now fine‑tune a pre‑trained language model on the released 81 K dataset and obtain a competent GUI agent in a few hours, rather than weeks of costly data collection.
- More reliable automation scripts: Action‑aware SFT reduces “hallucinated clicks” where the model reasons correctly but issues an out‑of‑scope UI command, a common pain point in current open‑source agents.
- Safer RL deployment: The KL trust region acts as a built‑in safeguard, preventing the policy from taking wildly exploratory (and potentially destructive) actions during online learning—critical for production environments that cannot afford UI crashes.
- Cross‑platform applicability: Because the dataset spans both web and mobile interactions, the same fine‑tuning pipeline can be reused for desktop, web, or mobile automation tools, lowering the barrier for multi‑platform bots.
Limitations & Future Work
- Partial verifiability still relies on a single demonstrated action; while KL regularization mitigates the issue, truly multi‑modal verification (e.g., using UI state equivalence classes) remains unexplored.
- Dataset bias: The curated 81 K examples are drawn from a limited set of popular apps and websites; performance may degrade on niche or highly dynamic UIs.
- Scalability of RLVR: The current RL loop is offline‑heavy; extending it to large‑scale, on‑device learning (e.g., edge mobile agents) will require more efficient credit‑assignment methods.
- User intent handling: The work assumes well‑specified natural‑language goals; integrating ambiguous or multi‑intent queries is an open research direction.
GUI‑Libra demonstrates that thoughtful data curation and training recipes can bridge the performance gap for open‑source GUI agents, offering a practical roadmap for developers eager to build reliable, reasoning‑capable automation tools.
Authors
- Rui Yang
- Qianhui Wu
- Zhaoyang Wang
- Hanyang Chen
- Ke Yang
- Hao Cheng
- Huaxiu Yao
- Baoling Peng
- Huan Zhang
- Jianfeng Gao
- Tong Zhang
Paper Information
- arXiv ID: 2602.22190v1
- Categories: cs.LG, cs.AI, cs.CL
- Published: February 25, 2026
- PDF: Download PDF