[Paper] bigMICE: Multiple Imputation of Big Data
Source: arXiv - 2601.21613v1
Overview
Missing data is a silent killer of insight, especially in massive health registries where even a small bias can affect policy decisions. The authors introduce bigMICE, an open‑source Spark‑based implementation of Multiple Imputation by Chained Equations (MICE) that can run on ordinary laptops while handling datasets that would normally require a cluster or huge RAM. Their experiments on a national Swedish medical registry show that bigMICE is both faster and more memory‑efficient than classic MICE tools, without sacrificing imputation quality.
Key Contributions
- Scalable MICE engine built on Apache Spark MLLib/ML that lets users set a hard memory cap.
- Memory‑controlled execution enabling imputation of multi‑gigabyte tables on low‑end hardware.
- Comprehensive benchmark on a real‑world medical registry (hundreds of thousands of rows, dozens of variables) measuring runtime, RAM usage, and imputation accuracy across different sample sizes and missingness levels.
- Practical guide for installing, configuring, and running bigMICE, complete with best‑practice recommendations.
- Open‑source release under a permissive license, encouraging community extensions and integration with existing Spark pipelines.
Methodology
- Chunked Data Processing – The dataset is split into Spark partitions that respect a user‑defined memory budget. Each partition is imputed independently using the classic MICE algorithm (iterative conditional models).
- Distributed Model Fitting – For each variable with missing values, a regression (linear, logistic, or other GLM) is fit on the observed part of the column using Spark’s parallel ML estimators.
- Iterative Chaining – The imputed values are fed back into the next variable’s model, repeating for a configurable number of cycles (the “multiple” part).
- Multiple Imputation – The whole process is run m times (default = 5) to generate several complete datasets, which can later be pooled using Rubin’s rules.
- Quality Checks – The authors compare imputed values against held‑out ground truth (when available) and compute standard diagnostics (e.g., bias, coverage, RMSE) across varying missingness rates (10 %–70 %).
All steps are orchestrated through a thin Python wrapper, so developers can call bigMICE.fit_transform(df) just like any other Spark ML transformer.
Results & Findings
| Metric | bigMICE | Traditional MICE (e.g., mice in R) |
|---|---|---|
| Peak RAM (10 M rows, 30 % missing) | ~2 GB (configurable) | >12 GB (often crashes) |
| Total runtime (10 M rows) | ~12 min (8 cores) | ~45 min (single node) |
| Imputation bias (simulated ground truth) | <0.02 % across all missingness levels | <0.03 % (similar) |
| Coverage of 95 % CI | 94–96 % | 93–95 % |
Key takeaways
- Memory savings are dramatic; bigMICE stays within the user‑set limit even as the dataset grows.
- Speed gains come from parallel model fitting and avoiding data copies that plague in‑memory MICE implementations.
- Statistical quality is on par with the gold‑standard MICE, even when a single variable is 70 % missing—thanks to the large overall sample size providing enough information for the conditional models.
Practical Implications
- Data engineers can now embed robust multiple imputation directly into ETL pipelines that already run on Spark, eliminating the need for a separate preprocessing step on a high‑memory workstation.
- Machine‑learning teams gain cleaner training data without sacrificing scalability, which can improve model generalization in domains like predictive health analytics, fraud detection, or IoT telemetry.
- Healthcare analysts can run reproducible imputations on national registries from a laptop, democratizing access to high‑quality statistical methods and speeding up policy‑impact studies.
- Cost reduction: organizations can avoid provisioning large RAM instances or dedicated Spark clusters solely for imputation, freeing resources for downstream modeling or inference workloads.
Limitations & Future Work
- Model flexibility: bigMICE currently supports linear, logistic, and Poisson GLMs; extending to more exotic learners (e.g., gradient‑boosted trees) would broaden applicability.
- Convergence diagnostics are limited to basic iteration counts; richer diagnostics (trace plots, Gelman‑Rubin statistics) are planned for a future release.
- Distributed storage assumptions: the implementation assumes data fits into Spark’s DataFrame abstraction; extremely sparse or highly categorical datasets may still hit performance bottlenecks.
- Future research will explore adaptive chunk sizing, integration with Spark’s Structured Streaming for online imputation, and benchmarking on non‑healthcare big data (e.g., e‑commerce clickstreams).
bigMICE bridges the gap between rigorous statistical imputation and modern big‑data processing frameworks, making multiple imputation a first‑class citizen in production data pipelines. If you’re already running Spark jobs, give the package a spin and see how much cleaner your data can become without breaking the budget.
Authors
- Hugo Morvan
- Jonas Agholme
- Bjorn Eliasson
- Katarina Olofsson
- Ludger Grote
- Fredrik Iredahl
- Oleg Sysoev
Paper Information
- arXiv ID: 2601.21613v1
- Categories: stat.CO, cs.DC
- Published: January 29, 2026
- PDF: Download PDF