[Paper] Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity

Published: (February 9, 2026 at 10:51 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.08816v1

Overview

The paper Permissive‑Washing in the Open AI Supply Chain exposes a hidden legal risk in today’s booming AI ecosystem: most open‑source datasets, models, and applications that claim “MIT‑style” permissive licenses are missing the very license files and attribution notices required to make that claim enforceable. By auditing over 124 k AI supply‑chain links on Hugging Face and GitHub, the authors show that the vast majority of artifacts are effectively unlicensed, putting downstream developers at risk of copyright infringement.

Key Contributions

  • Large‑scale empirical audit of 124,278 dataset → model → application chains (3,338 datasets, 6,664 models, 28,516 applications).
  • Quantitative evidence of “permissive washing”: > 95 % of datasets and models lack the mandatory license text; only a handful meet both license‑text and copyright‑notice requirements.
  • Propagation analysis: Demonstrates that even when upstream artifacts are properly licensed, downstream models and applications rarely preserve the required attribution (27.6 % for models, 5.8 % for applications).
  • Open research artefacts: Releases the full audit dataset and a reproducible pipeline so the community can continue monitoring license compliance.
  • Legal‑technical insight: Clarifies that metadata (e.g., tags on GitHub) is not a legal substitute for the actual license file and copyright notice.

Methodology

  1. Data collection – The authors crawled public repositories on Hugging Face and GitHub, extracting every declared link from a dataset to a model and from a model to an application.
  2. License extraction – For each artifact they searched the repository tree for a LICENSE file, a COPYRIGHT file, or inline license headers. They also parsed SPDX identifiers in package.json, setup.cfg, etc.
  3. Compliance checking – An artifact was deemed compliant only if (a) the full permissive‑license text was present, and (b) a copyright notice referencing the upstream author(s) was included.
  4. Propagation tracing – Using the harvested dependency graph, they verified whether downstream artifacts copied the required license text and attribution from their immediate upstream source.
  5. Statistical analysis – The team computed compliance rates per artifact type, examined distribution across license families (MIT, Apache‑2.0, BSD‑3), and performed correlation tests to see if factors like repository size or star count affect compliance.

Results & Findings

Artifact type% with full license text% meeting both text + copyright% preserving upstream notice downstream
Datasets96.5 % missing2.3 % compliantN/A
Models95.8 % missing3.2 % compliant27.6 % preserve dataset notice
Applications— (license usually on model)5.8 % preserve model notice (6.4 % any upstream notice)
  • License text omission is the norm, not the exception.
  • Attribution decay: Even when a model correctly includes a dataset’s license, the downstream application almost never carries that attribution forward.
  • Metadata illusion: Many repositories list a permissive SPDX identifier in README or pyproject.toml, but without the actual license file this does not satisfy legal requirements.

Practical Implications

  • Developers can’t rely on tags alone – Before reusing a dataset or model, verify the presence of a LICENSE file and a proper copyright line.
  • CI/CD checks – Integrate automated license‑file detection (e.g., using the authors’ pipeline) into build pipelines to flag missing documentation early.
  • Corporate risk management – Legal teams should treat “permissively‑licensed” AI assets as potentially unlicensed until the required files are confirmed, adjusting due‑diligence checklists accordingly.
  • Open‑source maintainers – Adding a clear LICENSE file and explicit attribution in the repository root can dramatically improve downstream compliance and protect the community from litigation.
  • Tooling opportunities – There is a market for plugins (for GitHub Actions, Hugging Face Spaces, etc.) that automatically copy upstream license notices when publishing derived models or applications.

Limitations & Future Work

  • Scope limited to public repos on Hugging Face and GitHub; private or enterprise‑hosted AI assets may exhibit different compliance patterns.
  • License families examined are primarily MIT, Apache‑2.0, BSD‑3; other permissive or copyleft licenses were not the focus.
  • Static analysis only – The study does not assess whether missing license files are intentional (e.g., proprietary intent) or accidental.
  • Future directions suggested include extending the audit to other platforms (GitLab, Bitbucket), studying the impact of licensing tools (e.g., REUSE), and measuring how compliance evolves after the release of the authors’ dataset and pipeline.

Authors

  • James Jewitt
  • Gopi Krishnan Rajbahadur
  • Hao Li
  • Bram Adams
  • Ahmed E. Hassan

Paper Information

  • arXiv ID: 2602.08816v1
  • Categories: cs.LG, cs.AI, cs.CY, cs.SE
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »