[Paper] Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Published: 3 days ago (June 9, 2026 at 10:59 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.10956v1

Overview

The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China’s National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.

Key Contributions

This paper presents research in the following areas:

cs.AI
cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.AI.

Authors

Tengchao Lv
Dongdong Zhang
Jiayu Ding
Yilin Jia
Yuzhong Zhao
Yupan Huang
Wenshan Wu
Xiangyang Zhou
Shaohan Huang
Nan Yang
Li Dong
Lei Cui
Furu Wei

Paper Information

arXiv ID: 2606.10956v1
Categories: cs.AI, cs.CL
Published: June 9, 2026
PDF: Download PDF

[Paper] Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

[Paper] Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

[Paper] SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation