[Paper] Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study
Source: arXiv - 2602.22760v1
Overview
The paper explores a novel way to cut the carbon footprint—and cost—of pre‑training large language models (LLMs) by syncing compute jobs with renewable curtailment windows. By running GPU‑intensive training only when excess clean energy is available, the authors demonstrate that a 561 M‑parameter transformer can be trained across multiple data‑center sites with emissions reduced to a fraction of the usual baseline.
Key Contributions
- Curtailment‑aware scheduling framework that dynamically switches between single‑site and federated multi‑site training based on real‑time renewable excess.
- Prototype implementation using the Flower federated‑learning library to coordinate three geographically distributed GPU clusters.
- Empirical evaluation showing that training quality (perplexity, loss convergence) is preserved while operational emissions drop to 5‑12 % of a traditional single‑site run.
- Open‑source data pipeline that ingests public marginal carbon‑intensity traces to predict curtailment windows for multiple regions.
Methodology
- Data‑driven curtailment detection – The authors pull hourly marginal carbon intensity data (e.g., from ENTSO‑E, CAISO) and flag periods where the intensity falls below a renewable‑dominant threshold, indicating surplus clean power.
- Elastic training orchestration – A central scheduler monitors the curtailment signals for each site. When a site enters a curtailment window, it is added to the training pool; when the window closes, the site is gracefully removed.
- Federated synchronization – While multiple sites are active, each runs a local copy of the model on its GPU cluster. After a configurable number of local steps, the sites exchange weight updates via Flower’s secure aggregation, effectively performing a distributed SGD step.
- Fallback to single‑site mode – If only one site has excess power, the system continues training locally, avoiding idle time.
- Evaluation metrics – Model convergence (loss, perplexity) is compared against a baseline that trains continuously on a single data‑center. Energy consumption and carbon emissions are estimated using the same marginal intensity data.
Results & Findings
| Metric | Baseline (single‑site) | Curtailment‑aware (3‑site) |
|---|---|---|
| Final validation loss | 1.84 | 1.86 |
| Perplexity (test) | 12.3 | 12.5 |
| Total GPU‑hours | 4,800 | 4,950 (≈ 1 % overhead) |
| CO₂‑equivalent emissions | 1.0 × (baseline) | 0.05‑0.12 × (5‑12 %) |
| Average training wall‑time | 7 days | 7.3 days |
Key takeaways:
- Model quality remains essentially unchanged despite the intermittent, distributed nature of the training.
- Energy savings are dramatic because the system only consumes power when it is already being generated cleanly and at low marginal cost.
- The communication overhead of federated synchronization is modest (≈ 1 % extra GPU‑hours).
Practical Implications
- Cost reduction for AI teams – Many cloud providers already price excess renewable energy lower; aligning training jobs with those windows can slash electricity bills.
- Sustainability certifications – Companies can claim “curtailment‑powered training” as a concrete, measurable ESG initiative, which is increasingly important for investors and customers.
- Edge‑to‑cloud training pipelines – The elastic federated approach can be repurposed for scenarios where compute resources are sporadic (e.g., volunteer GPU networks, edge devices with solar panels).
- Policy alignment – Grid operators seeking to reduce curtailment penalties could incentivize AI workloads, creating a win‑win market for clean‑energy utilization.
Limitations & Future Work
- Dependence on accurate curtailment forecasts – Mis‑predicted windows can lead to idle GPUs or missed training steps; integrating more sophisticated weather and market models is a next step.
- Scalability to multi‑billion‑parameter models – The study stops at 561 M parameters; larger models will stress network bandwidth and may need hierarchical aggregation strategies.
- Geographic and regulatory constraints – Not all regions expose granular marginal carbon data, limiting the approach’s global applicability.
- Security & privacy – While Flower provides secure aggregation, real‑world deployments will need hardened protocols to protect model IP during cross‑site weight exchanges.
Future research will explore adaptive learning‑rate schedules that react to the irregular training cadence, tighter integration with renewable‑energy market APIs, and extending the framework to support mixed‑precision training for even larger models.
Authors
- Philipp Wiesner
- Soeren Becker
- Brett Cornick
- Dominik Scheinert
- Alexander Acker
- Odej Kao
Paper Information
- arXiv ID: 2602.22760v1
- Categories: cs.DC, cs.AI
- Published: February 26, 2026
- PDF: Download PDF