[Paper] ArkEval: Benchmarking and Evaluating Automated CodeRepair for ArkTS

Published: 3 days ago (February 9, 2026 at 11:28 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.08866v1

Overview

The paper introduces ArkEval, the first dedicated benchmark for automated code‑repair of ArkTS—Huawei’s statically‑typed extension of TypeScript used in HarmonyOS apps. By assembling a realistic set of reproducible bugs and a novel test‑generation pipeline, the authors provide a solid yardstick for measuring how well large language models (LLMs) can fix real‑world mobile code.

Key Contributions

First ArkTS repair benchmark: 502 reproducible bugs mined from >400 open‑source HarmonyOS applications.
LLM‑driven test generation: A voting‑based approach that leverages Claude and other LLMs to synthesize reliable test cases for each bug.
Standardized problem statements: Uniform description format that removes ambiguities and enables fair, head‑to‑head model comparison.
Evaluation pipeline: Retrieval‑augmented repair workflow applied to four state‑of‑the‑art LLMs, exposing their strengths and blind spots on ArkTS.
Open‑source release: Benchmark data, scripts, and evaluation harness are publicly available for the community.

Methodology

Data collection – The authors crawled Huawei’s official ArkTS repository, extracting issue reports and associated pull requests.
Filtering & reproducibility – A multi‑stage manual and automated filter removed non‑bug issues, duplicated reports, and anything that could not be rebuilt in a clean environment, leaving 502 bugs that compile and fail a test.
Test generation – Because many issues lacked explicit test suites, the team prompted several LLMs (Claude, GPT‑4, etc.) to write unit tests that capture the failing behavior. A simple voting scheme kept only tests that multiple models agreed on, boosting reliability.
Problem statement normalization – Each bug is expressed in a concise template (description, location, expected behavior) so that any repair system receives the same input format.
Repair workflow – For evaluation, the authors used a retrieval‑augmented pipeline: relevant code snippets are first fetched from the repository, then fed to the LLM along with the standardized bug description. The model outputs a patch, which is applied and run against the generated tests.

Results & Findings

Model	Pass@1 (exact fix)	Pass@5	Avg. patch size	Notable strength
Claude‑2	22%	34%	12 lines	Handles type‑related errors well
GPT‑4	18%	29%	14 lines	Good at API misuse fixes
LLaMA‑2‑70B	12%	21%	16 lines	Performs decently on simple syntax bugs
CodeBERT‑based	8%	15%	10 lines	Struggles with ArkTS‑specific constructs

Overall success: Even the best model solves only about one‑quarter of the bugs on the first try, indicating substantial room for improvement.
Error patterns: Most failures stem from misunderstandings of ArkTS’s static typing rules and its unique UI component lifecycle, which are under‑represented in the models’ training data.
Retrieval impact: Adding a retrieval step improves success rates by ~6‑8%, confirming that context matters for low‑resource languages like ArkTS.

Practical Implications

Tooling for HarmonyOS developers: ArkEval can serve as a regression suite for IDE plugins that aim to suggest quick‑fixes or auto‑refactorings in ArkTS projects.
LLM fine‑tuning: The benchmark offers a concrete, domain‑specific dataset for further pre‑training or instruction‑tuning of LLMs, potentially yielding models that understand ArkTS semantics out‑of‑the‑box.
Continuous integration: Teams can integrate the test‑generation pipeline into CI pipelines to automatically surface failing tests for newly reported bugs, accelerating triage.
Cross‑platform insights: Findings highlight that generic code‑repair models trained primarily on JavaScript/TypeScript struggle with ArkTS, suggesting that similar low‑resource extensions (e.g., Deno, Flow) may need dedicated data.

Limitations & Future Work

Test quality reliance on LLMs: Although the voting mechanism mitigates noise, generated tests may still miss edge cases, leading to over‑optimistic pass rates.
Static benchmark size: 502 bugs is a solid start but still modest compared to JavaScript/Java benchmarks; expanding coverage across more ArkTS libraries and newer HarmonyOS APIs is needed.
Model diversity: Only four LLMs were evaluated; future work should include open‑source models and explore multi‑modal inputs (e.g., UI screenshots) that are common in mobile development.
Human‑in‑the‑loop repair: The current pipeline is fully automated; incorporating developer feedback could improve patch relevance and help train more usable assistants.

ArkEval opens the door for systematic, data‑driven improvement of automated repair tools in the rapidly growing HarmonyOS ecosystem. For developers eager to adopt AI‑assisted debugging, the benchmark offers both a challenge and a roadmap toward more reliable, ArkTS‑aware assistants.

Authors

Bang Xie
Senjian Zhang
Zhiyuan Peng
Wei Chen
Chenhao Ying
Yuan Luo

Paper Information

arXiv ID: 2602.08866v1
Categories: cs.SE
Published: February 9, 2026
PDF: Download PDF

[Paper] ArkEval: Benchmarking and Evaluating Automated CodeRepair for ArkTS

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Deriving and Validating Requirements Engineering Principles for Large-Scale Agile Development: An Industrial Longitudinal Study

[Paper] Hidden Licensing Risks in the LLMware Ecosystem

[Paper] Assessing Vision-Language Models for Perception in Autonomous Underwater Robotic Software

[Paper] Artisan: Agentic Artifact Evaluation