[Paper] gridfm-datakit-v1: A Python Library for Scalable and Realistic Power Flow and Optimal Power Flow Data Generation

Published: 1 month ago (December 16, 2025 at 01:17 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.14658v1

Overview

The paper presents gridfm‑datakit‑v1, an open‑source Python library that can automatically generate large, realistic Power Flow (PF) and Optimal Power Flow (OPF) datasets. By tackling three long‑standing gaps in existing data generators, the authors enable machine‑learning researchers and power‑system engineers to train and benchmark ML‑based solvers on data that truly reflects the variability and constraints of real‑world grids.

Key Contributions

Unified stochastic load modeling – mixes real‑world load‑profile scaling with localized random noise, producing diverse yet physically plausible demand patterns.
Arbitrary N‑k topology perturbations – supports random line outages or reconfigurations, allowing users to explore contingency scenarios without hand‑crafting cases.
Beyond‑limit PF samples – deliberately generates power‑flow states that violate voltage or thermal limits, helping ML models learn to detect and correct infeasible operating points.
Variable generator cost functions – creates OPF instances with randomly sampled cost curves, improving model generalization across different market conditions.
Scalable to very large networks – demonstrated on test systems up to 10 k buses with modest compute resources.
Easy integration – distributed via PyPI (pip install gridfm-datakit) and released under the permissive Apache 2.0 license; API mirrors familiar Pandas/NumPy patterns.

Methodology

Load & Profile Generation
- Starts from a base load vector (e.g., a 24‑h profile from a utility).
- Applies a global scaling factor drawn from a distribution that reflects daily/seasonal demand swings.
- Adds local perturbations (Gaussian or uniform noise) per bus to capture stochastic consumption.
Topology Randomization
- Users specify an N‑k budget (e.g., “remove up to 3 lines”).
- The library randomly selects lines to open, ensuring the resulting network stays connected (or intentionally creates islands for contingency studies).
Power‑Flow Solving
- For each load‑topology pair, a standard Newton‑Raphson PF solver (via pandapower/PYPOWER) computes voltages, flows, and losses.
- If the solution violates limits, the sample is still kept (this is a key departure from most datasets that discard such cases).
OPF Instance Creation
- Generator cost coefficients (quadratic, linear, constant) are sampled from user‑defined ranges.
- The OPF problem is solved with an interior‑point algorithm; both the optimal dispatch and the associated dual variables are stored.
Data Packaging
- Results are exported as lightweight HDF5/Parquet files, together with metadata (seed, scaling factor, topology changes).
- A small helper class (DataKitLoader) streams batches directly into PyTorch or TensorFlow pipelines.

The entire pipeline is parallelized with Python’s concurrent.futures, allowing a 10 k‑bus system to generate tens of thousands of samples in under an hour on a 16‑core workstation.

Results & Findings

Test System	#Samples	Avg. Generation Time (s)	% PF Samples Violating Limits
IEEE‑14	50 k	0.12	8 %
IEEE‑118	200 k	0.45	12 %
Synthetic 10 k‑bus	30 k	3.8	15 %

Diversity boost: Compared to OPFData and PFΔ, gridfm‑datakit’s datasets show a 2–3× larger spread in load levels and a 5–10× higher incidence of limit‑violating states.
Training impact: A simple feed‑forward NN trained on the new PF data achieved 94 % accuracy in predicting voltage violations, versus 78 % when trained on conventional (feasible‑only) datasets.
Scalability: Memory footprint grows linearly with bus count; the library stays under 8 GB RAM for the 10 k‑bus case, making it suitable for cloud‑based batch jobs.

These numbers demonstrate that the library not only produces richer data but also translates into measurable gains for downstream ML models.

Practical Implications

ML‑based grid operators can now train solvers that are robust to unexpected overloads, enabling faster “what‑if” analyses during emergencies.
Renewable integration studies benefit from realistic stochastic load and topology variations, improving the fidelity of scenario‑based planning tools.
Market simulation platforms can inject dynamic generator cost curves, allowing analysts to test pricing algorithms under a wider set of economic conditions.
Software vendors can embed gridfm‑datakit into their test suites to automatically generate regression datasets for new PF/OPF solvers, reducing manual data‑curation effort.
Educational tools gain a plug‑and‑play source of diverse cases, helping students explore contingency analysis without building custom data pipelines.

Limitations & Future Work

The current implementation relies on deterministic PF solvers; stochastic or probabilistic power‑flow methods are not yet supported.
While topology perturbations preserve connectivity by default, more sophisticated contingency models (e.g., N‑k‑m with islanding) are left to the user.
The library focuses on balanced, single‑phase networks; extending to unbalanced three‑phase distribution models is planned.
Future releases aim to integrate GPU‑accelerated PF solvers and to provide a benchmark suite that automatically evaluates ML model performance across generated datasets.

Authors

Alban Puech
Matteo Mazzonelli
Celia Cintas
Tamara R. Govindasamy
Mangaliso Mngomezulu
Jonas Weiss
Matteo Baù
Anna Varbella
François Mirallès
Kibaek Kim
Le Xie
Hendrik F. Hamann
Etienne Vos
Thomas Brunschwiler

Paper Information

arXiv ID: 2512.14658v1
Categories: cs.LG, cs.AI, eess.SY, math.OC
Published: December 16, 2025
PDF: Download PDF

[Paper] gridfm-datakit-v1: A Python Library for Scalable and Realistic Power Flow and Optimal Power Flow Data Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy