[Paper] Beyond Accuracy: An Empirical Study of Uncertainty Estimation in Imputation

Published: 1 month ago (November 26, 2025 at 12:27 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21607v1

Overview

Missing values are a fact of life in real‑world datasets, and the way we fill them (imputation) can dramatically affect downstream analytics. While recent imputation techniques focus on reconstruction accuracy, they also claim to provide uncertainty estimates—how confident the model is about each imputed entry. This paper delivers the first large‑scale, systematic comparison of those uncertainty estimates across statistical, optimal‑transport, and deep‑generative families, revealing that high accuracy does not guarantee trustworthy uncertainty.

Key Contributions

Comprehensive benchmark covering 6 representative imputers (MICE, SoftImpute, OT‑Impute, GAIN, MIWAE, TabCSDI) on dozens of public tabular datasets.
Three uncertainty‑estimation pipelines evaluated side‑by‑side:
1. Variability across multiple runs,
2. Conditional sampling from the model, and
3. Explicit predictive‑distribution modeling.
Calibration‑focused evaluation using reliability diagrams and Expected Calibration Error (ECE), a metric more common in classification but adapted here for continuous imputation.
Empirical insight that reconstruction error and calibration are often orthogonal; the best‑looking imputer can be the worst at quantifying its own doubt.
Practical guidance on selecting an imputer based on the trade‑off between accuracy, calibration, and runtime, plus “stable configurations” that work well across missingness mechanisms (MCAR, MAR, MNAR).

Methodology

Datasets & Missingness – The authors sampled a diverse set of tabular benchmarks (e.g., UCI, healthcare, finance) and artificially introduced missing values under three canonical mechanisms:
- MCAR (Missing Completely at Random)
- MAR (Missing at Random)
- MNAR (Missing Not at Random)
  Missingness rates ranged from 10 % to 50 %.
Imputation Families –
- Statistical: Multiple Imputation by Chained Equations (MICE) and SoftImpute (matrix‑completion).
- Distribution Alignment: OT‑Impute, which aligns observed and latent distributions via optimal transport.
- Deep Generative: GAIN (GAN‑based), MIWAE (variational auto‑encoder with importance weighting), and TabCSDI (conditional diffusion).
Uncertainty Estimation –
- Multi‑run variability: Train the same model several times with different random seeds; the spread of imputations serves as an uncertainty proxy.
- Conditional sampling: Draw multiple samples from the model’s conditional distribution given observed entries (e.g., multiple GAN or diffusion draws).
- Predictive‑distribution modeling: Directly use the model’s learned posterior variance (e.g., VAE’s Gaussian decoder variance).
Evaluation –
- Calibration curves: Plot predicted confidence intervals against empirical coverage.
- Expected Calibration Error (ECE): Quantify the average deviation between predicted and observed confidence.
- Reconstruction error: Root‑mean‑square error (RMSE) on held‑out ground truth.
- Runtime: Wall‑clock time on a single GPU/CPU configuration.

Results & Findings

Imputer	RMSE (lower = better)	ECE (lower = better)	Typical Runtime
MICE	★★	★★★★	★★
SoftImpute	★★	★★★	★
OT‑Impute	★★★	★★	★★
GAIN	★★★★	★★★	★★
MIWAE	★★★★	★	★★★
TabCSDI	★★★★★	★★	★★★★

Accuracy vs. Calibration: MIWAE and TabCSDI achieve the best calibration (lowest ECE) but are not always the most accurate in RMSE. Conversely, GAIN often yields low RMSE but poor calibration.
Missingness Mechanism Matters: Under MNAR, OT‑Impute’s transport‑based alignment retains relatively stable calibration, while statistical methods degrade sharply.
Uncertainty Estimation Route: Conditional sampling consistently outperforms multi‑run variability for deep generative models, whereas predictive‑distribution modeling works best for VAEs (MIWAE).
Runtime Trade‑off: Simple statistical methods are fast but provide weak uncertainty signals; diffusion‑based TabCSDI offers strong uncertainty at a high computational cost.

Practical Implications

Data‑Cleaning Pipelines: When downstream models are sensitive to imputation error (e.g., risk scoring), prioritize calibrated uncertainty (MIWAE or TabCSDI) to flag dubious entries for manual review.
Active Learning & Experiment Design: Use calibrated uncertainty to guide selective data collection—focus on features with high imputation variance to reduce overall model risk.
Model‑Based Decision Systems: In regulated domains (finance, healthcare), reporting calibrated confidence intervals can satisfy compliance requirements that plain point estimates cannot.
Resource Allocation: For large‑scale batch jobs where latency matters, OT‑Impute offers a sweet spot—reasonable accuracy, decent calibration, and moderate runtime.
Tooling: The authors release an open‑source benchmark suite that plugs into popular Python data stacks (pandas, scikit‑learn, PyTorch), making it easy for engineers to swap imputers and automatically obtain calibration diagnostics.

Limitations & Future Work

Synthetic Missingness: All experiments rely on artificially induced missingness; real‑world MNAR patterns may be more complex.
Calibration Metric Scope: ECE, while informative, aggregates over all variables; per‑feature calibration could reveal hidden biases.
Scalability: Diffusion‑based TabCSDI struggles on >1 M rows; future work could explore hierarchical or streaming variants.
Beyond Tabular: Extending the study to mixed‑type (text + numeric) or time‑series data remains an open challenge.

[Paper] Beyond Accuracy: An Empirical Study of Uncertainty Estimation in Imputation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Related posts

[Paper] EvilGenie: A Reward Hacking Benchmark

[Paper] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

[Paper] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

[Paper] CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection