[Paper] ArkEval: Benchmarking and Evaluating Automated CodeRepair for ArkTS
Source: arXiv - 2602.08866v1
Overview
The paper introduces ArkEval, the first dedicated benchmark for automated code‑repair of ArkTS—Huawei’s statically‑typed extension of TypeScript used in HarmonyOS apps. By assembling a realistic set of reproducible bugs and a novel test‑generation pipeline, the authors provide a solid yardstick for measuring how well large language models (LLMs) can fix real‑world mobile code.
Key Contributions
- First ArkTS repair benchmark: 502 reproducible bugs mined from >400 open‑source HarmonyOS applications.
- LLM‑driven test generation: A voting‑based approach that leverages Claude and other LLMs to synthesize reliable test cases for each bug.
- Standardized problem statements: Uniform description format that removes ambiguities and enables fair, head‑to‑head model comparison.
- Evaluation pipeline: Retrieval‑augmented repair workflow applied to four state‑of‑the‑art LLMs, exposing their strengths and blind spots on ArkTS.
- Open‑source release: Benchmark data, scripts, and evaluation harness are publicly available for the community.
Methodology
- Data collection – The authors crawled Huawei’s official ArkTS repository, extracting issue reports and associated pull requests.
- Filtering & reproducibility – A multi‑stage manual and automated filter removed non‑bug issues, duplicated reports, and anything that could not be rebuilt in a clean environment, leaving 502 bugs that compile and fail a test.
- Test generation – Because many issues lacked explicit test suites, the team prompted several LLMs (Claude, GPT‑4, etc.) to write unit tests that capture the failing behavior. A simple voting scheme kept only tests that multiple models agreed on, boosting reliability.
- Problem statement normalization – Each bug is expressed in a concise template (description, location, expected behavior) so that any repair system receives the same input format.
- Repair workflow – For evaluation, the authors used a retrieval‑augmented pipeline: relevant code snippets are first fetched from the repository, then fed to the LLM along with the standardized bug description. The model outputs a patch, which is applied and run against the generated tests.
Results & Findings
| Model | Pass@1 (exact fix) | Pass@5 | Avg. patch size | Notable strength |
|---|---|---|---|---|
| Claude‑2 | 22% | 34% | 12 lines | Handles type‑related errors well |
| GPT‑4 | 18% | 29% | 14 lines | Good at API misuse fixes |
| LLaMA‑2‑70B | 12% | 21% | 16 lines | Performs decently on simple syntax bugs |
| CodeBERT‑based | 8% | 15% | 10 lines | Struggles with ArkTS‑specific constructs |
- Overall success: Even the best model solves only about one‑quarter of the bugs on the first try, indicating substantial room for improvement.
- Error patterns: Most failures stem from misunderstandings of ArkTS’s static typing rules and its unique UI component lifecycle, which are under‑represented in the models’ training data.
- Retrieval impact: Adding a retrieval step improves success rates by ~6‑8%, confirming that context matters for low‑resource languages like ArkTS.
Practical Implications
- Tooling for HarmonyOS developers: ArkEval can serve as a regression suite for IDE plugins that aim to suggest quick‑fixes or auto‑refactorings in ArkTS projects.
- LLM fine‑tuning: The benchmark offers a concrete, domain‑specific dataset for further pre‑training or instruction‑tuning of LLMs, potentially yielding models that understand ArkTS semantics out‑of‑the‑box.
- Continuous integration: Teams can integrate the test‑generation pipeline into CI pipelines to automatically surface failing tests for newly reported bugs, accelerating triage.
- Cross‑platform insights: Findings highlight that generic code‑repair models trained primarily on JavaScript/TypeScript struggle with ArkTS, suggesting that similar low‑resource extensions (e.g., Deno, Flow) may need dedicated data.
Limitations & Future Work
- Test quality reliance on LLMs: Although the voting mechanism mitigates noise, generated tests may still miss edge cases, leading to over‑optimistic pass rates.
- Static benchmark size: 502 bugs is a solid start but still modest compared to JavaScript/Java benchmarks; expanding coverage across more ArkTS libraries and newer HarmonyOS APIs is needed.
- Model diversity: Only four LLMs were evaluated; future work should include open‑source models and explore multi‑modal inputs (e.g., UI screenshots) that are common in mobile development.
- Human‑in‑the‑loop repair: The current pipeline is fully automated; incorporating developer feedback could improve patch relevance and help train more usable assistants.
ArkEval opens the door for systematic, data‑driven improvement of automated repair tools in the rapidly growing HarmonyOS ecosystem. For developers eager to adopt AI‑assisted debugging, the benchmark offers both a challenge and a roadmap toward more reliable, ArkTS‑aware assistants.
Authors
- Bang Xie
- Senjian Zhang
- Zhiyuan Peng
- Wei Chen
- Chenhao Ying
- Yuan Luo
Paper Information
- arXiv ID: 2602.08866v1
- Categories: cs.SE
- Published: February 9, 2026
- PDF: Download PDF