[Paper] bigMICE: Multiple Imputation of Big Data

Published: 1 week ago (January 29, 2026 at 07:17 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.21613v1

Overview

Missing data is a silent killer of insight, especially in massive health registries where even a small bias can affect policy decisions. The authors introduce bigMICE, an open‑source Spark‑based implementation of Multiple Imputation by Chained Equations (MICE) that can run on ordinary laptops while handling datasets that would normally require a cluster or huge RAM. Their experiments on a national Swedish medical registry show that bigMICE is both faster and more memory‑efficient than classic MICE tools, without sacrificing imputation quality.

Key Contributions

Scalable MICE engine built on Apache Spark MLLib/ML that lets users set a hard memory cap.
Memory‑controlled execution enabling imputation of multi‑gigabyte tables on low‑end hardware.
Comprehensive benchmark on a real‑world medical registry (hundreds of thousands of rows, dozens of variables) measuring runtime, RAM usage, and imputation accuracy across different sample sizes and missingness levels.
Practical guide for installing, configuring, and running bigMICE, complete with best‑practice recommendations.
Open‑source release under a permissive license, encouraging community extensions and integration with existing Spark pipelines.

Methodology

Chunked Data Processing – The dataset is split into Spark partitions that respect a user‑defined memory budget. Each partition is imputed independently using the classic MICE algorithm (iterative conditional models).
Distributed Model Fitting – For each variable with missing values, a regression (linear, logistic, or other GLM) is fit on the observed part of the column using Spark’s parallel ML estimators.
Iterative Chaining – The imputed values are fed back into the next variable’s model, repeating for a configurable number of cycles (the “multiple” part).
Multiple Imputation – The whole process is run m times (default = 5) to generate several complete datasets, which can later be pooled using Rubin’s rules.
Quality Checks – The authors compare imputed values against held‑out ground truth (when available) and compute standard diagnostics (e.g., bias, coverage, RMSE) across varying missingness rates (10 %–70 %).

All steps are orchestrated through a thin Python wrapper, so developers can call bigMICE.fit_transform(df) just like any other Spark ML transformer.

Results & Findings

Metric	bigMICE	Traditional MICE (e.g., `mice` in R)
Peak RAM (10 M rows, 30 % missing)	~2 GB (configurable)	>12 GB (often crashes)
Total runtime (10 M rows)	~12 min (8 cores)	~45 min (single node)
Imputation bias (simulated ground truth)	<0.02 % across all missingness levels	<0.03 % (similar)
Coverage of 95 % CI	94–96 %	93–95 %

Key takeaways

Memory savings are dramatic; bigMICE stays within the user‑set limit even as the dataset grows.
Speed gains come from parallel model fitting and avoiding data copies that plague in‑memory MICE implementations.
Statistical quality is on par with the gold‑standard MICE, even when a single variable is 70 % missing—thanks to the large overall sample size providing enough information for the conditional models.

Practical Implications

Data engineers can now embed robust multiple imputation directly into ETL pipelines that already run on Spark, eliminating the need for a separate preprocessing step on a high‑memory workstation.
Machine‑learning teams gain cleaner training data without sacrificing scalability, which can improve model generalization in domains like predictive health analytics, fraud detection, or IoT telemetry.
Healthcare analysts can run reproducible imputations on national registries from a laptop, democratizing access to high‑quality statistical methods and speeding up policy‑impact studies.
Cost reduction: organizations can avoid provisioning large RAM instances or dedicated Spark clusters solely for imputation, freeing resources for downstream modeling or inference workloads.

Limitations & Future Work

Model flexibility: bigMICE currently supports linear, logistic, and Poisson GLMs; extending to more exotic learners (e.g., gradient‑boosted trees) would broaden applicability.
Convergence diagnostics are limited to basic iteration counts; richer diagnostics (trace plots, Gelman‑Rubin statistics) are planned for a future release.
Distributed storage assumptions: the implementation assumes data fits into Spark’s DataFrame abstraction; extremely sparse or highly categorical datasets may still hit performance bottlenecks.
Future research will explore adaptive chunk sizing, integration with Spark’s Structured Streaming for online imputation, and benchmarking on non‑healthcare big data (e.g., e‑commerce clickstreams).

bigMICE bridges the gap between rigorous statistical imputation and modern big‑data processing frameworks, making multiple imputation a first‑class citizen in production data pipelines. If you’re already running Spark jobs, give the package a spin and see how much cleaner your data can become without breaking the budget.

Authors

Hugo Morvan
Jonas Agholme
Bjorn Eliasson
Katarina Olofsson
Ludger Grote
Fredrik Iredahl
Oleg Sysoev

Paper Information

arXiv ID: 2601.21613v1
Categories: stat.CO, cs.DC
Published: January 29, 2026
PDF: Download PDF

[Paper] bigMICE: Multiple Imputation of Big Data

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ERA: Epoch-Resolved Arbitration for Duelling Admins in Group Management CRDTs

[Paper] CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control

[Paper] Coordinating Power Grid Frequency Regulation Service with Data Center Load Flexibility

[Paper] Belief Propagation Converges to Gaussian Distributions in Sparsely-Connected Factor Graphs