[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Published: 3 days ago (February 25, 2026 at 01:34 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.22190v1

Overview

GUI‑Libra tackles a persistent gap between open‑source and proprietary GUI‑automation agents, especially on long‑horizon tasks such as multi‑step web or mobile workflows. By redesigning the data pipeline and the fine‑tuning / reinforcement‑learning stages, the authors show that native agents can achieve dramatically higher success rates without the need for massive online interaction data.

Key Contributions

Curated reasoning dataset: 81 K high‑quality “reason‑then‑act” examples for web and mobile GUIs, built with a systematic construction‑and‑filtering pipeline.
Action‑aware supervised fine‑tuning (SFT): A mixed‑data strategy that blends pure reasoning traces with direct‑action examples, plus token‑level re‑weighting that forces the model to focus on grounding actions.
Stabilized RL under partial verifiability: Introduction of a KL‑regularized trust region for the RL‑with‑verification‑reward (RLVR) loop, plus a success‑adaptive gradient scaling that down‑weights noisy negative updates when the environment is ambiguous.
Empirical validation: Consistent gains on several public web‑automation (e.g., MiniWoB) and mobile‑automation benchmarks, improving both step‑wise accuracy and end‑to‑end task completion.
Open resources: Release of the 81 K dataset, training code, and pretrained models to the community.

Methodology

Data Construction & Filtering
- Harvested raw interaction logs from existing GUI agents and human demonstrations.
- Applied heuristic filters (action‑token consistency, language fluency, duplicate removal) to keep only traces where the natural‑language reasoning aligns tightly with the subsequent UI action.
- Result: a clean, diverse corpus covering a wide range of UI elements (buttons, dropdowns, gestures, etc.).
Action‑Aware Supervised Fine‑Tuning
- Instead of pure chain‑of‑thought (CoT) prompts, the training mix includes:
  - Reason‑then‑action examples (text reasoning followed by the exact UI command).
  - Direct‑action examples (no reasoning, just the correct UI command).
- Token‑level loss re‑weighting amplifies gradients on action tokens and UI identifiers, encouraging the model to stay grounded while still reasoning.
Reinforcement Learning with Partial Verifiability (RLVR)
- Traditional step‑wise RL treats a single demonstrated action as the only “correct” one, even though many actions could be valid. This creates a partial verifiability problem that hurts offline metrics.
- GUI‑Libra adds a KL‑regularization term that penalizes the policy for drifting too far from the SFT baseline, effectively forming a trust region.
- A success‑adaptive scaling factor monitors online episode outcomes; when the agent succeeds, negative gradients from mismatched actions are attenuated, preventing over‑penalization of alternative valid moves.
Training Pipeline
- Stage 1: Action‑aware SFT on the curated 81 K dataset.
- Stage 2: KL‑regularized RLVR on a small set of offline trajectories, followed by a brief online fine‑tune (optional) to polish performance.

Results & Findings

Benchmark	Baseline (SFT‑only)	GUI‑Libra (SFT + RLVR)	↑ End‑to‑End Success
MiniWoB (web)	48 %	66 %	+18 pp
Mobile‑Env (Android)	42 %	61 %	+19 pp
Step‑wise Accuracy (average)	71 %	84 %	+13 pp

Offline metrics become predictive: The KL‑regularized RLVR correlates strongly (ρ ≈ 0.78) with online success, fixing the “partial verifiability” disconnect observed in prior work.
Ablation studies show that removing either the action‑aware token re‑weighting or the KL trust region drops performance by ~7‑9 pp, confirming each component’s necessity.
Data efficiency: With only ~10 K additional fine‑tuning steps, the model matches or exceeds closed‑source baselines that required millions of online interactions.

Practical Implications

Faster prototyping of UI bots: Developers can now fine‑tune a pre‑trained language model on the released 81 K dataset and obtain a competent GUI agent in a few hours, rather than weeks of costly data collection.
More reliable automation scripts: Action‑aware SFT reduces “hallucinated clicks” where the model reasons correctly but issues an out‑of‑scope UI command, a common pain point in current open‑source agents.
Safer RL deployment: The KL trust region acts as a built‑in safeguard, preventing the policy from taking wildly exploratory (and potentially destructive) actions during online learning—critical for production environments that cannot afford UI crashes.
Cross‑platform applicability: Because the dataset spans both web and mobile interactions, the same fine‑tuning pipeline can be reused for desktop, web, or mobile automation tools, lowering the barrier for multi‑platform bots.

Limitations & Future Work

Partial verifiability still relies on a single demonstrated action; while KL regularization mitigates the issue, truly multi‑modal verification (e.g., using UI state equivalence classes) remains unexplored.
Dataset bias: The curated 81 K examples are drawn from a limited set of popular apps and websites; performance may degrade on niche or highly dynamic UIs.
Scalability of RLVR: The current RL loop is offline‑heavy; extending it to large‑scale, on‑device learning (e.g., edge mobile agents) will require more efficient credit‑assignment methods.
User intent handling: The work assumes well‑specified natural‑language goals; integrating ambiguous or multi‑intent queries is an open research direction.

GUI‑Libra demonstrates that thoughtful data curation and training recipes can bridge the performance gap for open‑source GUI agents, offering a practical roadmap for developers eager to build reliable, reasoning‑capable automation tools.

Authors

Rui Yang
Qianhui Wu
Zhaoyang Wang
Hanyang Chen
Ke Yang
Hao Cheng
Huaxiu Yao
Baoling Peng
Huan Zhang
Jianfeng Gao
Tong Zhang

Paper Information

arXiv ID: 2602.22190v1
Categories: cs.LG, cs.AI, cs.CL
Published: February 25, 2026
PDF: Download PDF

[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

[Paper] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

[Paper] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?