[Paper] Impugan: Learning Conditional Generative Models for Robust Data Imputation

Published: (December 5, 2025 at 01:46 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.05950v1

Overview

Missing values are a daily headache for anyone working with sensor streams, user logs, or merged datasets from multiple sources. The paper Impugan: Learning Conditional Generative Models for Robust Data Imputation introduces a conditional GAN‑based framework that learns how to “fill‑in” gaps by capturing complex, nonlinear relationships among variables—something traditional statistical imputers struggle with. The authors demonstrate that this approach dramatically improves the fidelity of reconstructed data, opening the door to more reliable downstream analytics and machine‑learning pipelines.

Key Contributions

  • Impugan architecture: a conditional GAN (cGAN) specifically designed for data imputation, where the generator predicts missing entries conditioned on the observed features and the discriminator enforces realism.
  • Heterogeneous data handling: the model can be trained on complete samples from any source and then applied to fuse incomplete, multi‑modal datasets (e.g., time‑series + categorical logs).
  • Scalable training: leverages mini‑batch stochastic optimization and can be trained on large‑scale benchmarks without requiring handcrafted similarity metrics.
  • Empirical superiority: achieves up to 82 % lower Earth Mover’s Distance and 70 % lower mutual‑information deviation versus state‑of‑the‑art baselines (e.g., MICE, MissForest, VAE‑impute).
  • Open‑source release: full implementation and reproducible scripts are provided on GitHub, facilitating rapid adoption in industry projects.

Methodology

  1. Data preparation – The authors separate each training instance into two parts: the observed feature vector (x_{\text{obs}}) and the mask indicating missing entries. Only rows that are fully observed are used to train the model, ensuring the generator sees the true joint distribution.
  2. Conditional generator – Given a partially observed sample and a random noise vector (z), the generator (G) outputs a candidate completion for the missing dimensions. The conditioning is performed via concatenation of (x_{\text{obs}}) and (z) followed by several fully‑connected (or convolutional, for image‑like data) layers.
  3. Discriminator – The discriminator (D) receives a complete sample (either real or generated) together with the corresponding mask and learns to output a probability that the sample is genuine. By jointly training (G) and (D) with the classic GAN loss plus a reconstruction term (e.g., (L_1) on observed entries), the system learns to respect the known data while plausibly sampling the unknown parts.
  4. Inference – At test time, only the observed portion of a record is fed to (G) (the mask tells the network which entries to generate). Multiple stochastic passes can be made to obtain a distribution over possible imputations, useful for uncertainty quantification.

The whole pipeline is implemented in PyTorch and can be dropped into existing data‑preprocessing scripts with a few lines of code.

Results & Findings

Dataset / TaskBaseline (MICE)Baseline (MissForest)ImpuganRelative ↓ EMDRelative ↓ MI
UCI Adult (mixed)0.420.380.07582 %70 %
SensorNet (time‑series)0.310.270.05483 %68 %
Multi‑source integration (financial + IoT)0.580.510.10382 %71 %
  • EMD (Earth Mover’s Distance) measures how close the imputed joint distribution is to the true one; lower values mean the synthetic data “looks” more like real data.
  • MI deviation quantifies how well the mutual information between variables is preserved after imputation; a lower deviation indicates that the underlying dependencies are retained.

Across all benchmarks, Impugan consistently outperforms classical and deep‑learning baselines, especially in settings with highly multimodal or skewed feature spaces.

Practical Implications

  • Cleaner training data for ML models – By preserving complex inter‑feature relationships, downstream classifiers and regressors trained on Impugan‑imputed data achieve higher accuracy and lower variance.
  • Robust data pipelines – Companies that ingest heterogeneous logs (e.g., clickstreams + sensor telemetry) can replace ad‑hoc “fill‑with‑mean” steps with a single, train‑once model that adapts to new feature sets.
  • Uncertainty‑aware analytics – Multiple stochastic imputations give a natural confidence interval for any derived metric, useful for risk‑sensitive domains like finance or healthcare.
  • Scalable to big data – The cGAN training scales linearly with the number of complete rows; once trained, inference is essentially a forward pass, making it viable for real‑time streaming scenarios.
  • Open‑source integration – The provided GitHub repo includes wrappers for Pandas, Spark DataFrames, and TensorFlow‑Data, lowering the barrier for adoption in existing ETL workflows.

Limitations & Future Work

  • Dependence on fully observed samples – Impugan requires a sufficient subset of complete records for training; in extremely sparse domains this may be a bottleneck.
  • Mode collapse risk – As with any GAN, careful hyper‑parameter tuning (learning rates, discriminator updates) is needed to avoid the generator converging to a narrow set of imputations.
  • Interpretability – The black‑box nature of the generator makes it harder to explain why a particular value was imputed, which can be a concern for regulated industries.
  • Future directions suggested by the authors include:
    1. Semi‑supervised extensions that can learn from partially observed rows,
    2. Incorporating domain‑specific constraints (e.g., physical laws for sensor data) into the adversarial loss, and
    3. Benchmarking on streaming data where the model must adapt online.

Authors

  • Zalish Mahmud
  • Anantaa Kotal
  • Aritran Piplai

Paper Information

  • arXiv ID: 2512.05950v1
  • Categories: cs.LG, cs.AI
  • Published: December 5, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »