[Paper] 'Write in English, Nobody Understands Your Language Here': A Study of Non-English Trends in Open-Source Repositories

Published: (February 22, 2026 at 09:31 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.19446v1

Overview

The paper investigates how open‑source software (OSS) is evolving from an English‑centric ecosystem to a truly multilingual one. By mining billions of GitHub interactions and tens of thousands of repositories, the authors reveal that non‑English communication—especially in Korean, Chinese, and Russian—is on the rise, reshaping collaboration dynamics and project visibility.

Key Contributions

  • Large‑scale multilingual analysis – Processed 9.14 B GitHub issues, PRs, and discussions plus 62.5 K repositories across 5 programming languages and 30 natural languages (2015‑2025).
  • Comprehensive language‑usage taxonomy – Tracked English vs. non‑English content in three OSS artefacts: (1) communication (issues/PR comments), (2) code (comments & string literals), and (3) documentation (README, Wiki, etc.).
  • Empirical trends – Demonstrated steady growth of non‑English participation, with Korean, Chinese, and Russian showing the strongest upward trajectories.
  • Visibility & participation gap – Showed that projects with predominantly non‑English content receive fewer stars, forks, and external contributors than comparable English‑dominant projects.
  • “Language tension” framework – Introduced a sociotechnical lens describing how native‑language expression can clash with community norms that privilege English, affecting onboarding and conflict resolution.

Methodology

  1. Data collection – Leveraged the GitHub Archive and the GHTorrent dataset to extract every public issue, pull request, and discussion comment posted between 2015 and 2025.
  2. Language detection – Applied a hybrid pipeline (fastText language ID + custom Unicode script heuristics) to label each textual snippet with one of 30 target languages.
  3. Repository sampling – Selected 62.5 K repositories written in Java, Python, JavaScript, C++, and Go, ensuring a balanced mix of project sizes and activity levels.
  4. Artefact extraction – Parsed source trees to collect code comments, string literals, and documentation files (README, CONTRIBUTING, Wiki pages).
  5. Metric construction – Computed language‑share ratios, growth rates, and visibility indicators (stars, forks, external contributors).
  6. Statistical analysis – Used mixed‑effects regression to isolate language trends while controlling for confounders such as project age and popularity.

The pipeline is deliberately modular, allowing other researchers or tooling teams to plug in additional languages or artefact types without re‑building the whole stack.

Results & Findings

AspectKey FindingInterpretation
CommunicationNon‑English comments grew from 3 % (2015) to 12 % (2025) of all issue/PR discussions.OSS conversations are becoming more linguistically diverse.
Code comments & stringsChinese and Korean comment density increased by ≈ 150 % in the last five years.Developers embed native‑language explanations directly in code, improving local understandability but reducing cross‑border readability.
DocumentationMultilingual READMEs rose from 1.8 % to 9.4 % of total docs.Projects are beginning to cater to non‑English audiences, yet many still provide only an English version.
Visibility gapProjects with > 70 % non‑English content receive ≈ 40 % fewer stars and 30 % fewer external contributors than English‑dominant peers of similar size.Language acts as a barrier to discovery and collaboration.
Language tensionSurveyed contributors reported “confusion” or “friction” when mixing English and native language in the same thread (≈ 22 % of respondents).Community norms still favor English, leading to potential exclusion or conflict.

Overall, the data confirm a steady multilingual shift but also highlight that English retains a strong gate‑keeping role in OSS visibility and participation.

Practical Implications

  1. Tooling for multilingual collaboration

    • IDE plugins and code‑review bots can auto‑detect non‑English comments and suggest inline translations or language tags, reducing comprehension gaps.
    • CI pipelines could enforce optional multilingual documentation policies (e.g., require an English README.en.md alongside a native README.zh.md).
  2. Community governance

    • Project maintainers might adopt clear language‑use guidelines (e.g., “English for all public discussion; native language allowed in comments with translations”).
    • Labels or bots that flag language‑mixing can help moderators mediate “language tension” before it escalates.
  3. Search & discovery

    • Search engines and GitHub’s recommendation algorithms could incorporate language metadata to surface non‑English projects to speakers of those languages, improving visibility.
  4. Onboarding & mentorship

    • Organizations hiring globally can leverage the findings to design onboarding materials in multiple languages, lowering the barrier for new contributors from non‑English backgrounds.
  5. Internationalization (i18n) best practices

    • The study underscores the need to treat code comments and string literals as first‑class i18n artifacts, not just UI text.

Limitations & Future Work

  • Language detection noise – Short snippets (e.g., single‑word comments) sometimes yield ambiguous IDs, potentially inflating or deflating language counts.
  • Platform bias – The analysis is limited to public GitHub data; private repositories or other platforms (GitLab, Bitbucket) may exhibit different patterns.
  • Causality vs. correlation – While a visibility gap is observed, the study cannot definitively prove that language alone causes lower star/fork counts; other factors (project marketing, network effects) may play roles.
  • Future directions – Extending the study to include runtime localization files, issue labeling practices, and cross‑project language migration; building real‑time multilingual dashboards for maintainers; and conducting controlled experiments on how translation bots affect contributor retention.

Authors

  • Masudul Hasan Masud Bhuiyan
  • Manish Kumar Bala Kumar
  • Cristian-Alexandru Staicu

Paper Information

  • arXiv ID: 2602.19446v1
  • Categories: cs.SE, cs.CY
  • Published: February 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »