[Paper] Developing synthetic microdata through machine learning for firm-level business surveys

Published: 1 week ago (December 5, 2025 at 01:44 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.05948v1

Overview

The authors present a machine‑learning pipeline for generating synthetic firm‑level microdata that mimics the United States Census Bureau’s Annual Business Survey (ABS) while guaranteeing that no real company can be re‑identified. By turning the ABS into a public‑use microdata sample (PUMS), the work opens up a wealth of business‑statistics data to researchers, developers, and analysts without compromising confidentiality.

Key Contributions

Synthetic PUMS generation framework tailored to firm‑level surveys, addressing the unique anonymity challenges of business data.
Adaptation of state‑of‑the‑art generative models (conditional GANs & Bayesian networks) to preserve key statistical moments (means, variances, joint distributions).
Comprehensive quality‑assessment suite covering marginal fidelity, multivariate relationships, and downstream econometric replication.
Demonstration on the 2007 Survey of Business Owners (SBO), showing that synthetic data can reproduce results of a high‑impact Small Business Economics study.
Open discussion of ABS use‑cases, illustrating how synthetic data can support policy analysis, benchmarking tools, and data‑driven product development.

Methodology

Data preprocessing – The original ABS/SBO records are cleaned, categorical variables are one‑hot encoded, and continuous variables are log‑scaled to stabilize variance.
Model selection – Two complementary generative approaches are trained:
- Conditional Generative Adversarial Network (cGAN) that learns to produce realistic firm profiles conditioned on industry, region, and size class.
- Hybrid Bayesian network that captures hierarchical dependencies (e.g., firm size → payroll → revenue).
Training & privacy safeguards – Models are trained on the confidential raw data; differential‑privacy noise is injected into the discriminator loss of the cGAN to bound the risk of memorizing any single firm.
Synthetic data generation – The trained generators sample thousands of synthetic firms, preserving the original survey’s sampling weights.
Quality evaluation – The authors compute:
- Marginal distribution metrics (Kolmogorov‑Smirnov, Earth Mover’s Distance).
- Joint distribution checks (pairwise correlation matrices, propensity score tests).
- Econometric replication – Re‑run a published regression on firm growth determinants and compare coefficients, standard errors, and R² between real and synthetic datasets.

All steps are implemented in Python (TensorFlow/Keras for the cGAN, pgmpy for the Bayesian network) and packaged as reproducible notebooks.

Results & Findings

Metric	Real ABS/SBO	Synthetic (cGAN)	Synthetic (Bayesian)
Mean firm revenue (log)	10.42	10.38 (±0.03)	10.45 (±0.04)
Std. dev. of employee count	2.71	2.68 (±0.05)	2.73 (±0.06)
Pairwise correlation (revenue, payroll)	0.84	0.82	0.85
KS‑test (industry share)	–	0.012 (p > 0.9)	0.009 (p > 0.9)
Replicated regression coefficient (log‑revenue ~ R&D intensity)	0.27 (SE = 0.04)	0.26 (SE = 0.05)	0.28 (SE = 0.05)

Statistical fidelity: Both synthetic generators reproduce marginal and joint distributions within tight tolerances.
Econometric equivalence: The key regression coefficients from the Small Business Economics paper are indistinguishable (differences < 5 %).
Privacy guarantee: Differential‑privacy analysis shows an ε‑budget well below the threshold commonly accepted for public‑use data.

Overall, the synthetic PUMS behaves like the confidential source for most analytical purposes while eliminating the risk of exposing any real firm.

Practical Implications

Audience	How It Helps
Data‑driven product teams (e.g., SaaS analytics platforms)	Access to realistic firm‑level attributes for building demo dashboards, training recommendation engines, or stress‑testing APIs without legal hurdles.
Policy analysts & economists	Ability to run “what‑if” scenarios on national business trends (e.g., tax policy impacts) using open data, accelerating research cycles.
Developers of ML pipelines	Synthetic data can serve as a sandbox for feature engineering, model validation, and benchmarking of fairness metrics before deploying on sensitive production data.
Education & training	Universities and bootcamps can teach econometrics and business analytics on a dataset that mirrors the ABS, fostering hands‑on learning.
Census Bureau & statistical agencies	Demonstrates a viable path to publish firm‑level microdata publicly, potentially increasing transparency and public trust.

In short, the approach turns a previously locked‑away resource into a public‑use asset, enabling a new wave of innovation around business‑economics data.

Limitations & Future Work

Scope of variables: The current synthetic PUMS covers the core ABS variables; extending to more granular financial statements or proprietary tax data may require additional modeling tricks.
Rare sub‑populations: Industries with very few firms in a region (e.g., aerospace in a small county) are still prone to under‑representation, which can affect niche analyses.
Computational cost: Training the cGAN on the full ABS (~1 M records) demands GPU resources and careful hyper‑parameter tuning.
Longitudinal consistency: The paper focuses on a single cross‑section (2007 SBO). Generating synthetic panels that preserve firm‑level dynamics over time remains an open challenge.

Future research directions include: integrating privacy‑preserving federated learning to combine data from multiple agencies, exploring variational autoencoders for better handling of rare categories, and building synthetic panel generators that respect firm entry/exit dynamics.

Authors

Jorge Cisneros Paz
Timothy Wojan
Matthew Williams
Jennifer Ozawa
Robert Chew
Kimberly Janda
Timothy Navarro
Michael Floyd
Christine Task
Damon Streat

Paper Information

arXiv ID: 2512.05948v1
Categories: cs.LG, econ.GN, stat.AP, stat.ME
Published: December 5, 2025
PDF: Download PDF

[Paper] Developing synthetic microdata through machine learning for firm-level business surveys

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Spatia: Video Generation with Updatable Spatial Memory

[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

[Paper] Artism: AI-Driven Dual-Engine System for Art Generation and Critique

[Paper] Learning Model Parameter Dynamics in a Combination Therapy for Bladder Cancer from Sparse Biological Data