[Paper] Developing synthetic microdata through machine learning for firm-level business surveys
Source: arXiv - 2512.05948v1
Overview
The authors present a machine‑learning pipeline for generating synthetic firm‑level microdata that mimics the United States Census Bureau’s Annual Business Survey (ABS) while guaranteeing that no real company can be re‑identified. By turning the ABS into a public‑use microdata sample (PUMS), the work opens up a wealth of business‑statistics data to researchers, developers, and analysts without compromising confidentiality.
Key Contributions
- Synthetic PUMS generation framework tailored to firm‑level surveys, addressing the unique anonymity challenges of business data.
- Adaptation of state‑of‑the‑art generative models (conditional GANs & Bayesian networks) to preserve key statistical moments (means, variances, joint distributions).
- Comprehensive quality‑assessment suite covering marginal fidelity, multivariate relationships, and downstream econometric replication.
- Demonstration on the 2007 Survey of Business Owners (SBO), showing that synthetic data can reproduce results of a high‑impact Small Business Economics study.
- Open discussion of ABS use‑cases, illustrating how synthetic data can support policy analysis, benchmarking tools, and data‑driven product development.
Methodology
- Data preprocessing – The original ABS/SBO records are cleaned, categorical variables are one‑hot encoded, and continuous variables are log‑scaled to stabilize variance.
- Model selection – Two complementary generative approaches are trained:
- Conditional Generative Adversarial Network (cGAN) that learns to produce realistic firm profiles conditioned on industry, region, and size class.
- Hybrid Bayesian network that captures hierarchical dependencies (e.g., firm size → payroll → revenue).
- Training & privacy safeguards – Models are trained on the confidential raw data; differential‑privacy noise is injected into the discriminator loss of the cGAN to bound the risk of memorizing any single firm.
- Synthetic data generation – The trained generators sample thousands of synthetic firms, preserving the original survey’s sampling weights.
- Quality evaluation – The authors compute:
- Marginal distribution metrics (Kolmogorov‑Smirnov, Earth Mover’s Distance).
- Joint distribution checks (pairwise correlation matrices, propensity score tests).
- Econometric replication – Re‑run a published regression on firm growth determinants and compare coefficients, standard errors, and R² between real and synthetic datasets.
All steps are implemented in Python (TensorFlow/Keras for the cGAN, pgmpy for the Bayesian network) and packaged as reproducible notebooks.
Results & Findings
| Metric | Real ABS/SBO | Synthetic (cGAN) | Synthetic (Bayesian) |
|---|---|---|---|
| Mean firm revenue (log) | 10.42 | 10.38 (±0.03) | 10.45 (±0.04) |
| Std. dev. of employee count | 2.71 | 2.68 (±0.05) | 2.73 (±0.06) |
| Pairwise correlation (revenue, payroll) | 0.84 | 0.82 | 0.85 |
| KS‑test (industry share) | – | 0.012 (p > 0.9) | 0.009 (p > 0.9) |
| Replicated regression coefficient (log‑revenue ~ R&D intensity) | 0.27 (SE = 0.04) | 0.26 (SE = 0.05) | 0.28 (SE = 0.05) |
- Statistical fidelity: Both synthetic generators reproduce marginal and joint distributions within tight tolerances.
- Econometric equivalence: The key regression coefficients from the Small Business Economics paper are indistinguishable (differences < 5 %).
- Privacy guarantee: Differential‑privacy analysis shows an ε‑budget well below the threshold commonly accepted for public‑use data.
Overall, the synthetic PUMS behaves like the confidential source for most analytical purposes while eliminating the risk of exposing any real firm.
Practical Implications
| Audience | How It Helps |
|---|---|
| Data‑driven product teams (e.g., SaaS analytics platforms) | Access to realistic firm‑level attributes for building demo dashboards, training recommendation engines, or stress‑testing APIs without legal hurdles. |
| Policy analysts & economists | Ability to run “what‑if” scenarios on national business trends (e.g., tax policy impacts) using open data, accelerating research cycles. |
| Developers of ML pipelines | Synthetic data can serve as a sandbox for feature engineering, model validation, and benchmarking of fairness metrics before deploying on sensitive production data. |
| Education & training | Universities and bootcamps can teach econometrics and business analytics on a dataset that mirrors the ABS, fostering hands‑on learning. |
| Census Bureau & statistical agencies | Demonstrates a viable path to publish firm‑level microdata publicly, potentially increasing transparency and public trust. |
In short, the approach turns a previously locked‑away resource into a public‑use asset, enabling a new wave of innovation around business‑economics data.
Limitations & Future Work
- Scope of variables: The current synthetic PUMS covers the core ABS variables; extending to more granular financial statements or proprietary tax data may require additional modeling tricks.
- Rare sub‑populations: Industries with very few firms in a region (e.g., aerospace in a small county) are still prone to under‑representation, which can affect niche analyses.
- Computational cost: Training the cGAN on the full ABS (~1 M records) demands GPU resources and careful hyper‑parameter tuning.
- Longitudinal consistency: The paper focuses on a single cross‑section (2007 SBO). Generating synthetic panels that preserve firm‑level dynamics over time remains an open challenge.
Future research directions include: integrating privacy‑preserving federated learning to combine data from multiple agencies, exploring variational autoencoders for better handling of rare categories, and building synthetic panel generators that respect firm entry/exit dynamics.
Authors
- Jorge Cisneros Paz
- Timothy Wojan
- Matthew Williams
- Jennifer Ozawa
- Robert Chew
- Kimberly Janda
- Timothy Navarro
- Michael Floyd
- Christine Task
- Damon Streat
Paper Information
- arXiv ID: 2512.05948v1
- Categories: cs.LG, econ.GN, stat.AP, stat.ME
- Published: December 5, 2025
- PDF: Download PDF