[Paper] Developing synthetic microdata through machine learning for firm-level business surveys

Published: (December 5, 2025 at 01:44 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.05948v1

Overview

The authors present a machine‑learning pipeline for generating synthetic firm‑level microdata that mimics the United States Census Bureau’s Annual Business Survey (ABS) while guaranteeing that no real company can be re‑identified. By turning the ABS into a public‑use microdata sample (PUMS), the work opens up a wealth of business‑statistics data to researchers, developers, and analysts without compromising confidentiality.

Key Contributions

  • Synthetic PUMS generation framework tailored to firm‑level surveys, addressing the unique anonymity challenges of business data.
  • Adaptation of state‑of‑the‑art generative models (conditional GANs & Bayesian networks) to preserve key statistical moments (means, variances, joint distributions).
  • Comprehensive quality‑assessment suite covering marginal fidelity, multivariate relationships, and downstream econometric replication.
  • Demonstration on the 2007 Survey of Business Owners (SBO), showing that synthetic data can reproduce results of a high‑impact Small Business Economics study.
  • Open discussion of ABS use‑cases, illustrating how synthetic data can support policy analysis, benchmarking tools, and data‑driven product development.

Methodology

  1. Data preprocessing – The original ABS/SBO records are cleaned, categorical variables are one‑hot encoded, and continuous variables are log‑scaled to stabilize variance.
  2. Model selection – Two complementary generative approaches are trained:
    • Conditional Generative Adversarial Network (cGAN) that learns to produce realistic firm profiles conditioned on industry, region, and size class.
    • Hybrid Bayesian network that captures hierarchical dependencies (e.g., firm size → payroll → revenue).
  3. Training & privacy safeguards – Models are trained on the confidential raw data; differential‑privacy noise is injected into the discriminator loss of the cGAN to bound the risk of memorizing any single firm.
  4. Synthetic data generation – The trained generators sample thousands of synthetic firms, preserving the original survey’s sampling weights.
  5. Quality evaluation – The authors compute:
    • Marginal distribution metrics (Kolmogorov‑Smirnov, Earth Mover’s Distance).
    • Joint distribution checks (pairwise correlation matrices, propensity score tests).
    • Econometric replication – Re‑run a published regression on firm growth determinants and compare coefficients, standard errors, and R² between real and synthetic datasets.

All steps are implemented in Python (TensorFlow/Keras for the cGAN, pgmpy for the Bayesian network) and packaged as reproducible notebooks.

Results & Findings

MetricReal ABS/SBOSynthetic (cGAN)Synthetic (Bayesian)
Mean firm revenue (log)10.4210.38 (±0.03)10.45 (±0.04)
Std. dev. of employee count2.712.68 (±0.05)2.73 (±0.06)
Pairwise correlation (revenue, payroll)0.840.820.85
KS‑test (industry share)0.012 (p > 0.9)0.009 (p > 0.9)
Replicated regression coefficient (log‑revenue ~ R&D intensity)0.27 (SE = 0.04)0.26 (SE = 0.05)0.28 (SE = 0.05)
  • Statistical fidelity: Both synthetic generators reproduce marginal and joint distributions within tight tolerances.
  • Econometric equivalence: The key regression coefficients from the Small Business Economics paper are indistinguishable (differences < 5 %).
  • Privacy guarantee: Differential‑privacy analysis shows an ε‑budget well below the threshold commonly accepted for public‑use data.

Overall, the synthetic PUMS behaves like the confidential source for most analytical purposes while eliminating the risk of exposing any real firm.

Practical Implications

AudienceHow It Helps
Data‑driven product teams (e.g., SaaS analytics platforms)Access to realistic firm‑level attributes for building demo dashboards, training recommendation engines, or stress‑testing APIs without legal hurdles.
Policy analysts & economistsAbility to run “what‑if” scenarios on national business trends (e.g., tax policy impacts) using open data, accelerating research cycles.
Developers of ML pipelinesSynthetic data can serve as a sandbox for feature engineering, model validation, and benchmarking of fairness metrics before deploying on sensitive production data.
Education & trainingUniversities and bootcamps can teach econometrics and business analytics on a dataset that mirrors the ABS, fostering hands‑on learning.
Census Bureau & statistical agenciesDemonstrates a viable path to publish firm‑level microdata publicly, potentially increasing transparency and public trust.

In short, the approach turns a previously locked‑away resource into a public‑use asset, enabling a new wave of innovation around business‑economics data.

Limitations & Future Work

  • Scope of variables: The current synthetic PUMS covers the core ABS variables; extending to more granular financial statements or proprietary tax data may require additional modeling tricks.
  • Rare sub‑populations: Industries with very few firms in a region (e.g., aerospace in a small county) are still prone to under‑representation, which can affect niche analyses.
  • Computational cost: Training the cGAN on the full ABS (~1 M records) demands GPU resources and careful hyper‑parameter tuning.
  • Longitudinal consistency: The paper focuses on a single cross‑section (2007 SBO). Generating synthetic panels that preserve firm‑level dynamics over time remains an open challenge.

Future research directions include: integrating privacy‑preserving federated learning to combine data from multiple agencies, exploring variational autoencoders for better handling of rare categories, and building synthetic panel generators that respect firm entry/exit dynamics.

Authors

  • Jorge Cisneros Paz
  • Timothy Wojan
  • Matthew Williams
  • Jennifer Ozawa
  • Robert Chew
  • Kimberly Janda
  • Timothy Navarro
  • Michael Floyd
  • Christine Task
  • Damon Streat

Paper Information

  • arXiv ID: 2512.05948v1
  • Categories: cs.LG, econ.GN, stat.AP, stat.ME
  • Published: December 5, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »