[Paper] ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval

Published: 2 months ago (February 5, 2026 at 06:15 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.05550v1

Overview

ArkTS is the primary language powering apps in the OpenHarmony ecosystem, but developers and researchers have struggled to build intelligent tools for it because no public code‑search datasets existed. This paper introduces ArkTS‑CodeSearch, the first large‑scale, open‑source collection of ArkTS functions paired with natural‑language comments, together with a benchmark that measures how well models can retrieve the right function from a textual query.

Key Contributions

First public ArkTS dataset: > 200 k comment‑function pairs harvested from GitHub and Gitee, cleaned with a custom tree‑sitter‑arkts parser.
Systematic benchmark (single‑search task): Given a natural‑language comment, models must rank the correct ArkTS function among thousands of candidates.
Comprehensive evaluation of existing code‑embedding models on the new benchmark, exposing their strengths and gaps for ArkTS.
Fine‑tuned embedding model that combines ArkTS and TypeScript training data, achieving state‑of‑the‑art retrieval performance on ArkTS code.
Open‑source release of the dataset and the fine‑tuned model on Hugging Face, enabling reproducibility and downstream tool building.

Methodology

Data collection – The authors crawled public ArkTS repositories from both GitHub and Gitee. Using the tree‑sitter‑arkts grammar, they extracted every function definition together with its preceding doc‑comment (the natural‑language description).
Deduplication & cleaning – Cross‑platform duplicate detection removed identical functions that appeared in multiple forks or mirrors. Functions were then categorized (e.g., UI, system API, utility) to understand the corpus composition.
Benchmark design – The single‑search task presents a comment and asks a model to retrieve the matching function from a large pool. Retrieval quality is measured with standard IR metrics such as Recall@k and MRR.
Model evaluation & fine‑tuning – Off‑the‑shelf code‑embedding models (e.g., CodeBERT, GraphCodeBERT, StarCoder) were evaluated directly on the benchmark. Afterwards, the authors fine‑tuned a base embedding model on a mixed ArkTS + TypeScript training set, optimizing a contrastive loss that pushes matching comment‑function pairs together in the embedding space.

The pipeline is deliberately kept simple so that other researchers can replicate it for new languages or extend the dataset with additional repositories.

Results & Findings

Model (pre‑trained)	Recall@1	Recall@5	MRR
CodeBERT	12.4 %	31.8 %	0.22
GraphCodeBERT	14.1 %	34.5 %	0.25
StarCoder (7B)	18.9 %	41.2 %	0.31
Fine‑tuned model (ArkTS+TS)	27.6 %	55.3 %	0.44

Existing multilingual code models struggle with ArkTS, likely because they have seen little ArkTS during pre‑training.
Adding TypeScript data (a syntactically close language) helps, but the biggest boost comes from fine‑tuning on the native ArkTS comment‑function pairs.
Error analysis shows most failures occur on highly generic comments (“initializes component”) or on functions that heavily rely on OpenHarmony‑specific APIs not covered in the training set.

Practical Implications

IDE assistance – The fine‑tuned embedding model can power “search‑by‑comment” features in OpenHarmony IDEs, letting developers locate existing implementations without remembering exact function names.
Automated documentation – By matching undocumented functions to the nearest comment in the embedding space, teams can auto‑generate or suggest doc‑strings for legacy code.
Bug triage & code review – Retrieval models can surface similar functions when a developer flags a suspicious snippet, aiding quick pattern‑based debugging.
Cross‑language tooling – Since the model also benefits from TypeScript data, it can serve as a bridge for developers migrating code between TypeScript and ArkTS, suggesting idiomatic equivalents.
Research acceleration – With a public benchmark, the community can now benchmark new code‑understanding models (e.g., LLMs, graph‑based encoders) on a real‑world, industry‑relevant language.

Limitations & Future Work

Dataset bias – The corpus is limited to open‑source repositories, which may over‑represent utility or demo code and under‑represent proprietary, performance‑critical modules.
Comment quality – Not all doc‑comments are well‑written; noisy or missing comments can affect both training and evaluation.
Language scope – The benchmark focuses on a single‑search task; other useful tasks (e.g., code generation, bug detection) remain unexplored.
Scalability – Retrieval experiments were conducted on a few hundred thousand functions; scaling to millions of functions (as in large enterprise codebases) may require additional indexing optimizations.

Future work could expand the dataset with more diverse repositories, incorporate multi‑modal signals (e.g., UI screenshots), and evaluate the model in downstream developer tools to measure real‑world productivity gains.

Authors

Yulong He
Artem Ermakov
Sergey Kovalchuk
Artem Aliev
Dmitry Shalymov

Paper Information

arXiv ID: 2602.05550v1
Categories: cs.SE, cs.CL
Published: February 5, 2026
PDF: Download PDF

[Paper] ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks