[Paper] Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study

Published: (February 26, 2026 at 03:49 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.22760v1

Overview

The paper explores a novel way to cut the carbon footprint—and cost—of pre‑training large language models (LLMs) by syncing compute jobs with renewable curtailment windows. By running GPU‑intensive training only when excess clean energy is available, the authors demonstrate that a 561 M‑parameter transformer can be trained across multiple data‑center sites with emissions reduced to a fraction of the usual baseline.

Key Contributions

  • Curtailment‑aware scheduling framework that dynamically switches between single‑site and federated multi‑site training based on real‑time renewable excess.
  • Prototype implementation using the Flower federated‑learning library to coordinate three geographically distributed GPU clusters.
  • Empirical evaluation showing that training quality (perplexity, loss convergence) is preserved while operational emissions drop to 5‑12 % of a traditional single‑site run.
  • Open‑source data pipeline that ingests public marginal carbon‑intensity traces to predict curtailment windows for multiple regions.

Methodology

  1. Data‑driven curtailment detection – The authors pull hourly marginal carbon intensity data (e.g., from ENTSO‑E, CAISO) and flag periods where the intensity falls below a renewable‑dominant threshold, indicating surplus clean power.
  2. Elastic training orchestration – A central scheduler monitors the curtailment signals for each site. When a site enters a curtailment window, it is added to the training pool; when the window closes, the site is gracefully removed.
  3. Federated synchronization – While multiple sites are active, each runs a local copy of the model on its GPU cluster. After a configurable number of local steps, the sites exchange weight updates via Flower’s secure aggregation, effectively performing a distributed SGD step.
  4. Fallback to single‑site mode – If only one site has excess power, the system continues training locally, avoiding idle time.
  5. Evaluation metrics – Model convergence (loss, perplexity) is compared against a baseline that trains continuously on a single data‑center. Energy consumption and carbon emissions are estimated using the same marginal intensity data.

Results & Findings

MetricBaseline (single‑site)Curtailment‑aware (3‑site)
Final validation loss1.841.86
Perplexity (test)12.312.5
Total GPU‑hours4,8004,950 (≈ 1 % overhead)
CO₂‑equivalent emissions1.0 × (baseline)0.05‑0.12 × (5‑12 %)
Average training wall‑time7 days7.3 days

Key takeaways:

  • Model quality remains essentially unchanged despite the intermittent, distributed nature of the training.
  • Energy savings are dramatic because the system only consumes power when it is already being generated cleanly and at low marginal cost.
  • The communication overhead of federated synchronization is modest (≈ 1 % extra GPU‑hours).

Practical Implications

  • Cost reduction for AI teams – Many cloud providers already price excess renewable energy lower; aligning training jobs with those windows can slash electricity bills.
  • Sustainability certifications – Companies can claim “curtailment‑powered training” as a concrete, measurable ESG initiative, which is increasingly important for investors and customers.
  • Edge‑to‑cloud training pipelines – The elastic federated approach can be repurposed for scenarios where compute resources are sporadic (e.g., volunteer GPU networks, edge devices with solar panels).
  • Policy alignment – Grid operators seeking to reduce curtailment penalties could incentivize AI workloads, creating a win‑win market for clean‑energy utilization.

Limitations & Future Work

  • Dependence on accurate curtailment forecasts – Mis‑predicted windows can lead to idle GPUs or missed training steps; integrating more sophisticated weather and market models is a next step.
  • Scalability to multi‑billion‑parameter models – The study stops at 561 M parameters; larger models will stress network bandwidth and may need hierarchical aggregation strategies.
  • Geographic and regulatory constraints – Not all regions expose granular marginal carbon data, limiting the approach’s global applicability.
  • Security & privacy – While Flower provides secure aggregation, real‑world deployments will need hardened protocols to protect model IP during cross‑site weight exchanges.

Future research will explore adaptive learning‑rate schedules that react to the irregular training cadence, tighter integration with renewable‑energy market APIs, and extending the framework to support mixed‑precision training for even larger models.

Authors

  • Philipp Wiesner
  • Soeren Becker
  • Brett Cornick
  • Dominik Scheinert
  • Alexander Acker
  • Odej Kao

Paper Information

  • arXiv ID: 2602.22760v1
  • Categories: cs.DC, cs.AI
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...