[Paper] 'Write in English, Nobody Understands Your Language Here': A Study of Non-English Trends in Open-Source Repositories

Published: 3 days ago (February 22, 2026 at 09:31 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.19446v1

Overview

The paper investigates how open‑source software (OSS) is evolving from an English‑centric ecosystem to a truly multilingual one. By mining billions of GitHub interactions and tens of thousands of repositories, the authors reveal that non‑English communication—especially in Korean, Chinese, and Russian—is on the rise, reshaping collaboration dynamics and project visibility.

Key Contributions

Large‑scale multilingual analysis – Processed 9.14 B GitHub issues, PRs, and discussions plus 62.5 K repositories across 5 programming languages and 30 natural languages (2015‑2025).
Comprehensive language‑usage taxonomy – Tracked English vs. non‑English content in three OSS artefacts: (1) communication (issues/PR comments), (2) code (comments & string literals), and (3) documentation (README, Wiki, etc.).
Empirical trends – Demonstrated steady growth of non‑English participation, with Korean, Chinese, and Russian showing the strongest upward trajectories.
Visibility & participation gap – Showed that projects with predominantly non‑English content receive fewer stars, forks, and external contributors than comparable English‑dominant projects.
“Language tension” framework – Introduced a sociotechnical lens describing how native‑language expression can clash with community norms that privilege English, affecting onboarding and conflict resolution.

Methodology

Data collection – Leveraged the GitHub Archive and the GHTorrent dataset to extract every public issue, pull request, and discussion comment posted between 2015 and 2025.
Language detection – Applied a hybrid pipeline (fastText language ID + custom Unicode script heuristics) to label each textual snippet with one of 30 target languages.
Repository sampling – Selected 62.5 K repositories written in Java, Python, JavaScript, C++, and Go, ensuring a balanced mix of project sizes and activity levels.
Artefact extraction – Parsed source trees to collect code comments, string literals, and documentation files (README, CONTRIBUTING, Wiki pages).
Metric construction – Computed language‑share ratios, growth rates, and visibility indicators (stars, forks, external contributors).
Statistical analysis – Used mixed‑effects regression to isolate language trends while controlling for confounders such as project age and popularity.

The pipeline is deliberately modular, allowing other researchers or tooling teams to plug in additional languages or artefact types without re‑building the whole stack.

Results & Findings

Aspect	Key Finding	Interpretation
Communication	Non‑English comments grew from 3 % (2015) to 12 % (2025) of all issue/PR discussions.	OSS conversations are becoming more linguistically diverse.
Code comments & strings	Chinese and Korean comment density increased by ≈ 150 % in the last five years.	Developers embed native‑language explanations directly in code, improving local understandability but reducing cross‑border readability.
Documentation	Multilingual READMEs rose from 1.8 % to 9.4 % of total docs.	Projects are beginning to cater to non‑English audiences, yet many still provide only an English version.
Visibility gap	Projects with > 70 % non‑English content receive ≈ 40 % fewer stars and 30 % fewer external contributors than English‑dominant peers of similar size.	Language acts as a barrier to discovery and collaboration.
Language tension	Surveyed contributors reported “confusion” or “friction” when mixing English and native language in the same thread (≈ 22 % of respondents).	Community norms still favor English, leading to potential exclusion or conflict.

Overall, the data confirm a steady multilingual shift but also highlight that English retains a strong gate‑keeping role in OSS visibility and participation.

Practical Implications

Tooling for multilingual collaboration
- IDE plugins and code‑review bots can auto‑detect non‑English comments and suggest inline translations or language tags, reducing comprehension gaps.
- CI pipelines could enforce optional multilingual documentation policies (e.g., require an English README.en.md alongside a native README.zh.md).
Community governance
- Project maintainers might adopt clear language‑use guidelines (e.g., “English for all public discussion; native language allowed in comments with translations”).
- Labels or bots that flag language‑mixing can help moderators mediate “language tension” before it escalates.
Search & discovery
- Search engines and GitHub’s recommendation algorithms could incorporate language metadata to surface non‑English projects to speakers of those languages, improving visibility.
Onboarding & mentorship
- Organizations hiring globally can leverage the findings to design onboarding materials in multiple languages, lowering the barrier for new contributors from non‑English backgrounds.
Internationalization (i18n) best practices
- The study underscores the need to treat code comments and string literals as first‑class i18n artifacts, not just UI text.

Limitations & Future Work

Language detection noise – Short snippets (e.g., single‑word comments) sometimes yield ambiguous IDs, potentially inflating or deflating language counts.
Platform bias – The analysis is limited to public GitHub data; private repositories or other platforms (GitLab, Bitbucket) may exhibit different patterns.
Causality vs. correlation – While a visibility gap is observed, the study cannot definitively prove that language alone causes lower star/fork counts; other factors (project marketing, network effects) may play roles.
Future directions – Extending the study to include runtime localization files, issue labeling practices, and cross‑project language migration; building real‑time multilingual dashboards for maintainers; and conducting controlled experiments on how translation bots affect contributor retention.

Authors

Masudul Hasan Masud Bhuiyan
Manish Kumar Bala Kumar
Cristian-Alexandru Staicu

Paper Information

arXiv ID: 2602.19446v1
Categories: cs.SE, cs.CY
Published: February 23, 2026
PDF: Download PDF

[Paper] 'Write in English, Nobody Understands Your Language Here': A Study of Non-English Trends in Open-Source Repositories

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Visual Milestone Planning in a Hybrid Development Context

[Paper] Detecting UX smells in Visual Studio Code using LLMs

[Paper] From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models

[Paper] An Empirical Study of Bugs in Modern LLM Agent Frameworks