[Paper] AIBoMGen: Generating an AI Bill of Materials for Secure, Transparent, and Compliant Model Training
Source: arXiv - 2601.05703v1
Overview
The paper introduces AIBoMGen, a prototype platform that automatically creates a cryptographically‑signed AI Bill of Materials (AIBOM) for every model‑training run. By capturing datasets, model hyper‑parameters, code versions, and the exact compute environment, AIBoMGen gives developers a tamper‑evident record that can be used to prove compliance with emerging AI regulations such as the EU AI Act.
Key Contributions
- AIBOM Specification – Extends the well‑known Software Bill of Materials (SBOM) to cover AI‑specific artifacts (training data, model weights, preprocessing pipelines, hardware details).
- Automated Generation Pipeline – AIBoMGen hooks into the training workflow and produces a signed AIBOM without manual effort.
- Root‑of‑Trust Architecture – The training platform acts as a neutral third‑party observer, using cryptographic hashes, digital signatures, and in‑toto attestations to guarantee integrity.
- Tamper‑Detection Guarantees – Demonstrates that any post‑training modification of model files, data, or environment metadata is reliably detected.
- Negligible Overhead – Empirical evaluation shows < 2 % runtime impact, making the approach practical for large‑scale training pipelines.
Methodology
-
Instrumentation Layer – A lightweight agent is attached to the training orchestrator (e.g., Kubernetes, Airflow). It records:
- Input datasets (hashes, provenance URLs)
- Code repository commits and dependency manifests
- Hyper‑parameters, model architecture, and training scripts
- Runtime environment (OS, driver versions, GPU/CPU specs)
-
Artifact Hashing & Collection – Each captured artifact is hashed (SHA‑256) and stored in a temporary ledger.
-
In‑toto Attestation – The collected hashes are wrapped in an in‑toto statement, which includes a cryptographic signature from the platform’s private key (the “root of trust”).
-
AIBOM Assembly – The attestation, together with a human‑readable JSON/YAML manifest, forms the final AIBOM.
-
Verification API – Downstream consumers (model registries, auditors, CI pipelines) can fetch the AIBOM and verify signatures and hashes against the actual artifacts, ensuring nothing was altered after training.
The whole flow is triggered automatically for every training job, requiring no extra steps from data scientists.
Results & Findings
| Metric | Observation |
|---|---|
| Tamper detection | All simulated attacks (weight file replacement, dataset substitution, environment downgrade) were flagged by the verification step. |
| Performance overhead | Average added latency = 1.7 % (≈ 2 seconds per hour‑long training job). |
| Signature verification time | Sub‑millisecond on a standard CPU, negligible for CI pipelines. |
| Scalability | Tested on 50 concurrent training jobs across 4 GPU nodes; AIBOM generation remained stable with linear resource usage. |
These results indicate that AIBoMGen can be deployed in production‑grade ML pipelines without sacrificing speed, while providing strong guarantees against artifact tampering.
Practical Implications
- Regulatory Compliance – Companies can produce auditable evidence that their models were trained on approved data and under controlled environments, easing EU AI Act reporting.
- Supply‑Chain Security – Just as SBOMs help secure software supply chains, AIBOMs expose hidden dependencies (e.g., third‑party datasets) that could be a source of bias or malicious data poisoning.
- Model Marketplace Trust – Vendors can attach a signed AIBOM to every model they sell, giving buyers confidence that the model hasn’t been altered post‑delivery.
- CI/CD Integration – The verification API can be plugged into existing MLOps pipelines (GitHub Actions, GitLab CI, Jenkins) to automatically reject builds that fail AIBOM checks.
- Incident Response – In the event of a breach, the AIBOM provides a forensic snapshot of exactly what was used to create the compromised model, speeding root‑cause analysis.
Limitations & Future Work
- Scope of Captured Artifacts – The current prototype focuses on static artifacts; dynamic runtime behaviors (e.g., on‑the‑fly data augmentation) are not fully captured.
- Key Management – The system assumes a secure, centrally‑managed signing key; a distributed key‑rotation strategy would be needed for large enterprises.
- Interoperability Standards – While the authors propose a JSON schema, broader industry adoption will require alignment with emerging standards bodies (e.g., SPDX, OpenChain).
- Extending to Inference – Future work could generate AI Bill of Materials for Inference (AIBOM‑I) that records model serving environment, request‑time preprocessing, and post‑processing steps.
Overall, AIBoMGen offers a concrete, low‑overhead path toward transparent and secure AI model lifecycles—an essential building block as AI moves from research labs into regulated production environments.
Authors
- Wiebe Vandendriessche
- Jordi Thijsman
- Laurens D’hooge
- Bruno Volckaert
- Merlijn Sebrechts
Paper Information
- arXiv ID: 2601.05703v1
- Categories: cs.SE, cs.AI, cs.CR
- Published: January 9, 2026
- PDF: Download PDF