[Paper] gridfm-datakit-v1: A Python Library for Scalable and Realistic Power Flow and Optimal Power Flow Data Generation

Published: (December 16, 2025 at 01:17 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.14658v1

Overview

The paper presents gridfm‑datakit‑v1, an open‑source Python library that can automatically generate large, realistic Power Flow (PF) and Optimal Power Flow (OPF) datasets. By tackling three long‑standing gaps in existing data generators, the authors enable machine‑learning researchers and power‑system engineers to train and benchmark ML‑based solvers on data that truly reflects the variability and constraints of real‑world grids.

Key Contributions

  • Unified stochastic load modeling – mixes real‑world load‑profile scaling with localized random noise, producing diverse yet physically plausible demand patterns.
  • Arbitrary N‑k topology perturbations – supports random line outages or reconfigurations, allowing users to explore contingency scenarios without hand‑crafting cases.
  • Beyond‑limit PF samples – deliberately generates power‑flow states that violate voltage or thermal limits, helping ML models learn to detect and correct infeasible operating points.
  • Variable generator cost functions – creates OPF instances with randomly sampled cost curves, improving model generalization across different market conditions.
  • Scalable to very large networks – demonstrated on test systems up to 10 k buses with modest compute resources.
  • Easy integration – distributed via PyPI (pip install gridfm-datakit) and released under the permissive Apache 2.0 license; API mirrors familiar Pandas/NumPy patterns.

Methodology

  1. Load & Profile Generation

    • Starts from a base load vector (e.g., a 24‑h profile from a utility).
    • Applies a global scaling factor drawn from a distribution that reflects daily/seasonal demand swings.
    • Adds local perturbations (Gaussian or uniform noise) per bus to capture stochastic consumption.
  2. Topology Randomization

    • Users specify an N‑k budget (e.g., “remove up to 3 lines”).
    • The library randomly selects lines to open, ensuring the resulting network stays connected (or intentionally creates islands for contingency studies).
  3. Power‑Flow Solving

    • For each load‑topology pair, a standard Newton‑Raphson PF solver (via pandapower/PYPOWER) computes voltages, flows, and losses.
    • If the solution violates limits, the sample is still kept (this is a key departure from most datasets that discard such cases).
  4. OPF Instance Creation

    • Generator cost coefficients (quadratic, linear, constant) are sampled from user‑defined ranges.
    • The OPF problem is solved with an interior‑point algorithm; both the optimal dispatch and the associated dual variables are stored.
  5. Data Packaging

    • Results are exported as lightweight HDF5/Parquet files, together with metadata (seed, scaling factor, topology changes).
    • A small helper class (DataKitLoader) streams batches directly into PyTorch or TensorFlow pipelines.

The entire pipeline is parallelized with Python’s concurrent.futures, allowing a 10 k‑bus system to generate tens of thousands of samples in under an hour on a 16‑core workstation.

Results & Findings

Test System#SamplesAvg. Generation Time (s)% PF Samples Violating Limits
IEEE‑1450 k0.128 %
IEEE‑118200 k0.4512 %
Synthetic 10 k‑bus30 k3.815 %
  • Diversity boost: Compared to OPFData and PFΔ, gridfm‑datakit’s datasets show a 2–3× larger spread in load levels and a 5–10× higher incidence of limit‑violating states.
  • Training impact: A simple feed‑forward NN trained on the new PF data achieved 94 % accuracy in predicting voltage violations, versus 78 % when trained on conventional (feasible‑only) datasets.
  • Scalability: Memory footprint grows linearly with bus count; the library stays under 8 GB RAM for the 10 k‑bus case, making it suitable for cloud‑based batch jobs.

These numbers demonstrate that the library not only produces richer data but also translates into measurable gains for downstream ML models.

Practical Implications

  • ML‑based grid operators can now train solvers that are robust to unexpected overloads, enabling faster “what‑if” analyses during emergencies.
  • Renewable integration studies benefit from realistic stochastic load and topology variations, improving the fidelity of scenario‑based planning tools.
  • Market simulation platforms can inject dynamic generator cost curves, allowing analysts to test pricing algorithms under a wider set of economic conditions.
  • Software vendors can embed gridfm‑datakit into their test suites to automatically generate regression datasets for new PF/OPF solvers, reducing manual data‑curation effort.
  • Educational tools gain a plug‑and‑play source of diverse cases, helping students explore contingency analysis without building custom data pipelines.

Limitations & Future Work

  • The current implementation relies on deterministic PF solvers; stochastic or probabilistic power‑flow methods are not yet supported.
  • While topology perturbations preserve connectivity by default, more sophisticated contingency models (e.g., N‑k‑m with islanding) are left to the user.
  • The library focuses on balanced, single‑phase networks; extending to unbalanced three‑phase distribution models is planned.
  • Future releases aim to integrate GPU‑accelerated PF solvers and to provide a benchmark suite that automatically evaluates ML model performance across generated datasets.

Authors

  • Alban Puech
  • Matteo Mazzonelli
  • Celia Cintas
  • Tamara R. Govindasamy
  • Mangaliso Mngomezulu
  • Jonas Weiss
  • Matteo Baù
  • Anna Varbella
  • François Mirallès
  • Kibaek Kim
  • Le Xie
  • Hendrik F. Hamann
  • Etienne Vos
  • Thomas Brunschwiler

Paper Information

  • arXiv ID: 2512.14658v1
  • Categories: cs.LG, cs.AI, eess.SY, math.OC
  • Published: December 16, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »