PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

Published: 3 days ago (February 25, 2026 at 03:11 PM EST)

1 min read

Source: Hacker News

We’re the team at Vibrant Labs (W24). We’ve been building environments for browser agents and quickly realized that existing benchmarks in this space didn’t capture the primary failure modes we were seeing in production (which scaled up as the number of applications and horizon length increase).

We built PA Bench (Personal Assistant Benchmark) to evaluate frontier computer/web‑use models on their ability to handle multi‑step workflows across simulated clones of Gmail and Calendar.

What’s next

We’re currently scaling the dataset to 3+ tabs and are building more high‑fidelity simulations for common enterprise workflows. We’d love to hear feedback on the benchmark and notes about what was/wasn’t surprising about the results.

Blog post:
Comments URL:

PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

What’s next

Related posts

Our Agreement with the Department of War

The whole thing was a scam

Show HN: Rust-powered document chunker for RAG – 40x faster, O(1) memory

Addressing Antigravity Bans and Reinstating Access