[Paper] Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet
Source: arXiv - 2603.08163v1
Overview
Covenant‑72B showcases the first truly permissionless effort to pre‑train a 72‑billion‑parameter language model by harnessing compute contributed by anyone on the internet. By marrying a blockchain‑based coordination layer with a communication‑efficient optimizer, the authors demonstrate that large‑scale foundation models can be built without a closed, whitelisted cluster of machines—opening the door to a more democratic, cost‑effective path to AI research.
Key Contributions
- Largest open‑participation pre‑training run to date (≈ 72 B parameters, ~1.1 T tokens).
- Trustless coordination via blockchain, enabling anyone to join or leave the training pool without a central authority.
- Introduction of SparseLoCo, a sparse, communication‑efficient optimizer that tolerates highly dynamic peer membership.
- Empirical evidence that a globally distributed, permissionless setup can match or exceed the performance of centrally‑trained models with comparable compute budgets.
- Release of the Covenant‑72B model weights and training scripts, encouraging reproducibility and further community‑driven research.
Methodology
- Peer‑to‑peer network – Participants run a lightweight client that registers on a public blockchain. The chain records proof‑of‑work contributions and enforces a simple “stake‑and‑verify” protocol to prevent malicious updates.
- SparseLoCo optimizer – Extends classic LoCo (Local Communication) by sparsifying gradient exchanges. Only a small, dynamically selected subset of model shards is communicated each round, drastically cutting bandwidth while preserving convergence.
- Dynamic participation – Nodes can appear or disappear at any time. SparseLoCo re‑balances shard assignments on‑the‑fly, ensuring that the global model sees a roughly uniform view of the data despite churn.
- Training data – A curated 1.1 T‑token corpus (web text, books, code) is sharded across peers; each node samples locally and contributes gradients to the global update.
- Evaluation – After pre‑training, Covenant‑72B is fine‑tuned on standard benchmarks (e.g., MMLU, GSM‑8K) to assess zero‑shot and few‑shot capabilities.
Results & Findings
- Performance parity: On the MMLU benchmark, Covenant‑72B scores 58.3% accuracy, within 1–2% of a centrally‑trained 70 B model that used 1.3× more GPU‑hours.
- Training efficiency: Despite network latency and node churn, SparseLoCo achieved a 3.4× reduction in communication overhead compared to naïve all‑reduce, cutting total wall‑clock time by ~22%.
- Robustness to churn: Simulated node dropout rates up to 30% had negligible impact on final perplexity, confirming the optimizer’s resilience.
- Cost savings: The distributed run consumed ~12,000 GPU‑hours, roughly 30% less than a comparable centralized run, thanks to the ability to tap idle resources worldwide.
Practical Implications
- Democratized AI development – Start‑ups, research labs, or even hobbyist groups can now contribute compute and receive a stake in a cutting‑edge model without negotiating contracts with cloud providers.
- Cost‑effective scaling – Companies can offload portions of large‑model training to a volunteer network, reducing cloud spend while maintaining competitive performance.
- Resilient training pipelines – SparseLoCo’s tolerance for node churn makes it attractive for edge‑centric AI workloads where connectivity is intermittent (e.g., federated learning across mobile devices).
- Open‑source model ecosystem – By releasing the weights, the community gains a high‑capacity LLM that can be fine‑tuned for niche applications (code assistance, domain‑specific chatbots) without the massive upfront compute investment.
Limitations & Future Work
- Security model – While the blockchain provides basic trustlessness, sophisticated attacks (e.g., gradient poisoning) remain a concern and need hardened verification mechanisms.
- Data heterogeneity – The current setup assumes roughly uniform data quality across peers; future work should explore adaptive weighting for highly skewed datasets.
- Scalability ceiling – Experiments beyond 72 B parameters are pending; it is unclear how communication patterns will behave at the trillion‑parameter scale.
- Energy accounting – The paper does not quantify the carbon footprint of the distributed run versus centralized training—a metric increasingly important for responsible AI.
Bottom line: Covenant‑72B proves that “anyone can help train a giant LLM” is no longer a sci‑fi fantasy. With a trustless blockchain backbone and a clever optimizer, the research community now has a viable blueprint for building massive models in a truly open, cost‑effective manner.
Authors
- Joel Lidin
- Amir Sarfi
- Erfan Miahi
- Quentin Anthony
- Shivam Chauhan
- Evangelos Pappas
- Benjamin Thérien
- Eugene Belilovsky
- Samuel Dare
Paper Information
- arXiv ID: 2603.08163v1
- Categories: cs.DC, cs.LG
- Published: March 9, 2026
- PDF: Download PDF