[Paper] ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval

Published: (February 5, 2026 at 06:15 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.05550v1

Overview

ArkTS is the primary language powering apps in the OpenHarmony ecosystem, but developers and researchers have struggled to build intelligent tools for it because no public code‑search datasets existed. This paper introduces ArkTS‑CodeSearch, the first large‑scale, open‑source collection of ArkTS functions paired with natural‑language comments, together with a benchmark that measures how well models can retrieve the right function from a textual query.

Key Contributions

  • First public ArkTS dataset: > 200 k comment‑function pairs harvested from GitHub and Gitee, cleaned with a custom tree‑sitter‑arkts parser.
  • Systematic benchmark (single‑search task): Given a natural‑language comment, models must rank the correct ArkTS function among thousands of candidates.
  • Comprehensive evaluation of existing code‑embedding models on the new benchmark, exposing their strengths and gaps for ArkTS.
  • Fine‑tuned embedding model that combines ArkTS and TypeScript training data, achieving state‑of‑the‑art retrieval performance on ArkTS code.
  • Open‑source release of the dataset and the fine‑tuned model on Hugging Face, enabling reproducibility and downstream tool building.

Methodology

  1. Data collection – The authors crawled public ArkTS repositories from both GitHub and Gitee. Using the tree‑sitter‑arkts grammar, they extracted every function definition together with its preceding doc‑comment (the natural‑language description).
  2. Deduplication & cleaning – Cross‑platform duplicate detection removed identical functions that appeared in multiple forks or mirrors. Functions were then categorized (e.g., UI, system API, utility) to understand the corpus composition.
  3. Benchmark design – The single‑search task presents a comment and asks a model to retrieve the matching function from a large pool. Retrieval quality is measured with standard IR metrics such as Recall@k and MRR.
  4. Model evaluation & fine‑tuning – Off‑the‑shelf code‑embedding models (e.g., CodeBERT, GraphCodeBERT, StarCoder) were evaluated directly on the benchmark. Afterwards, the authors fine‑tuned a base embedding model on a mixed ArkTS + TypeScript training set, optimizing a contrastive loss that pushes matching comment‑function pairs together in the embedding space.

The pipeline is deliberately kept simple so that other researchers can replicate it for new languages or extend the dataset with additional repositories.

Results & Findings

Model (pre‑trained)Recall@1Recall@5MRR
CodeBERT12.4 %31.8 %0.22
GraphCodeBERT14.1 %34.5 %0.25
StarCoder (7B)18.9 %41.2 %0.31
Fine‑tuned model (ArkTS+TS)27.6 %55.3 %0.44
  • Existing multilingual code models struggle with ArkTS, likely because they have seen little ArkTS during pre‑training.
  • Adding TypeScript data (a syntactically close language) helps, but the biggest boost comes from fine‑tuning on the native ArkTS comment‑function pairs.
  • Error analysis shows most failures occur on highly generic comments (“initializes component”) or on functions that heavily rely on OpenHarmony‑specific APIs not covered in the training set.

Practical Implications

  • IDE assistance – The fine‑tuned embedding model can power “search‑by‑comment” features in OpenHarmony IDEs, letting developers locate existing implementations without remembering exact function names.
  • Automated documentation – By matching undocumented functions to the nearest comment in the embedding space, teams can auto‑generate or suggest doc‑strings for legacy code.
  • Bug triage & code review – Retrieval models can surface similar functions when a developer flags a suspicious snippet, aiding quick pattern‑based debugging.
  • Cross‑language tooling – Since the model also benefits from TypeScript data, it can serve as a bridge for developers migrating code between TypeScript and ArkTS, suggesting idiomatic equivalents.
  • Research acceleration – With a public benchmark, the community can now benchmark new code‑understanding models (e.g., LLMs, graph‑based encoders) on a real‑world, industry‑relevant language.

Limitations & Future Work

  • Dataset bias – The corpus is limited to open‑source repositories, which may over‑represent utility or demo code and under‑represent proprietary, performance‑critical modules.
  • Comment quality – Not all doc‑comments are well‑written; noisy or missing comments can affect both training and evaluation.
  • Language scope – The benchmark focuses on a single‑search task; other useful tasks (e.g., code generation, bug detection) remain unexplored.
  • Scalability – Retrieval experiments were conducted on a few hundred thousand functions; scaling to millions of functions (as in large enterprise codebases) may require additional indexing optimizations.

Future work could expand the dataset with more diverse repositories, incorporate multi‑modal signals (e.g., UI screenshots), and evaluate the model in downstream developer tools to measure real‑world productivity gains.

Authors

  • Yulong He
  • Artem Ermakov
  • Sergey Kovalchuk
  • Artem Aliev
  • Dmitry Shalymov

Paper Information

  • arXiv ID: 2602.05550v1
  • Categories: cs.SE, cs.CL
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »