[Paper] Analyzing developer discussions on EU and US privacy legislation compliance in GitHub repositories

Published: (December 11, 2025 at 08:16 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.10618v1

Overview

This study dives into the real‑world chatter of open‑source developers on GitHub to see how they grapple with the EU’s GDPR and the US’s CCPA. By mining ≈ 33 k issue threads, the authors surface the concrete problems developers face when trying to make their code legally compliant, offering a practical roadmap for teams that need to “talk the talk” and “code the code” on privacy.

Key Contributions

  • Large‑scale empirical dataset – 32,820 GitHub issues related to GDPR/CCPA compliance collected across diverse repositories.
  • Taxonomy of privacy‑law discussions – 24 fine‑grained categories grouped into six high‑level clusters (features/bugs, consent, documentation, data storing/sharing, adaptability, general compliance).
  • Quantitative focus on user rights – Shows developers concentrate on the right to erasure, opt‑out, and access, while other rights (e.g., data portability, profiling) receive far less attention.
  • Mixed‑method analysis – Automatic tagging of law‑related concepts combined with manual coding of a 1,186‑issue sample to validate and enrich the taxonomy.
  • Actionable recommendations – Provides a checklist for practitioners, curriculum suggestions for educators, and research gaps for tool builders.

Methodology

  1. Data collection – The authors queried the GitHub REST API for issues containing keywords tied to GDPR and CCPA (e.g., “GDPR”, “privacy”, “data deletion”). After filtering out noise (spam, non‑English, duplicates), they retained 32,820 issue threads.
  2. Automatic labeling – Using a curated list of legal terms (user‑rights, principles, and obligations), a lightweight NLP pipeline flagged which issues mentioned specific GDPR/CCPA concepts.
  3. Manual sampling – From the automatically labeled pool, a stratified random sample of 1,186 issues was hand‑coded by two researchers. They assigned each issue to one of 24 discussion categories, iteratively refining the scheme until inter‑rater agreement exceeded 0.8 (Cohen’s κ).
  4. Clustering – The 24 categories were grouped into six logical clusters based on thematic similarity (e.g., all consent‑related categories fell under the “Consent” cluster).
  5. Quantitative analysis – Frequency counts and cross‑tabulations revealed which legal rights and technical concerns dominate the conversation.

Results & Findings

  • Dominant topics: “User consent” (≈ 28 % of issues) and “bugs/feature requests related to privacy” (≈ 22 %) are the top discussion clusters.
  • User‑right focus: Right to erasure (Delete), right to opt‑out, and right to access are mentioned in > 60 % of the privacy‑related issues; rights like data portability, profiling, or “right to be informed” appear in < 10 % of cases.
  • Technical pain points: Cookie management, logging, and data‑store configuration are the most frequent implementation challenges.
  • Documentation gaps: Developers often raise questions about how to document consent flows or privacy notices, indicating a lack of clear guidance in existing project READMEs or wikis.
  • Adaptability concerns: A smaller but notable slice of issues (≈ 7 %) discuss how to make systems flexible enough to accommodate future law changes or jurisdiction‑specific requirements.

Practical Implications

  • Prioritize the “big three” rights – Teams can fast‑track compliance by first implementing reliable deletion, opt‑out, and access mechanisms; the taxonomy shows these are the most requested features.
  • Add consent scaffolding early – Since consent‑related bugs dominate, integrating a consent‑management library (e.g., Cookiebot, OneTrust) or building a reusable consent module can reduce downstream issue volume.
  • Upgrade documentation practices – Embedding privacy impact statements and consent flow diagrams directly in repository wikis can pre‑empt many “how‑do‑I‑document‑this?” tickets.
  • Automated linting & CI checks – The taxonomy can seed rule sets for static analysis tools (e.g., detecting missing deletion endpoints or insecure cookie flags) that automatically flag compliance gaps during pull‑request reviews.
  • Curriculum design – Educators can use the six clusters as a syllabus backbone, ensuring students get hands‑on experience with consent UI, data‑store sanitization, and legal‑requirement documentation.
  • Tooling opportunities – The identified gaps (e.g., data‑portability support) point to a market for open‑source SDKs that abstract away the boilerplate of GDPR/CCPA compliance.

Limitations & Future Work

  • Language & platform bias – The dataset is limited to English‑language issues on public GitHub repos, potentially overlooking private or non‑English projects where compliance challenges differ.
  • Static snapshot – Issues were collected at a single point in time; the evolving nature of both legislation and tooling means the taxonomy may need periodic updates.
  • Depth of legal nuance – Automated tagging relies on keyword matching, which can miss subtle legal interpretations or context‑specific obligations.
  • Future directions – The authors suggest expanding the analysis to pull‑request discussions, issue comments, and other collaboration artifacts, as well as building a public benchmark dataset for training more sophisticated NLP models that can capture nuanced legal references.

Bottom line: By turning thousands of GitHub issue threads into a structured privacy‑law taxonomy, this work gives developers a practical map of where the compliance “pain” points lie and how to address them efficiently. Whether you’re building a new open‑source library or retrofitting an existing product, the six clusters and 24 categories provide a ready‑to‑use checklist for GDPR/CCPA‑ready development.

Authors

  • Georgia M. Kapitsaki
  • Maria Papoutsoglou
  • Christoph Treude
  • Ioanna Theophilou

Paper Information

  • arXiv ID: 2512.10618v1
  • Categories: cs.SE
  • Published: December 11, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »