Advanced Imputation with R Packages
Source: Dev.to
What Are Missing Values?
Imagine you are collecting survey data where participants fill out personal details. For someone who is married, the marital status will be married and they may provide the names of their spouse and children. For unmarried respondents, these fields will naturally be left blank.
This is a genuine example of missing values, but missing data can also occur due to human error (forgetting to enter data), incorrect entries (like a negative age), or system errors during data collection.
Before handling missing data, it’s important to identify which type of missingness you are dealing with.
Types of Missing Values
Missing data is typically classified into three categories:
MCAR (Missing Completely At Random)
Missing values occur randomly with no relationship to any other variable.
Example: A survey participant accidentally skips a question. MCAR is rare but easiest to handle because the missingness does not introduce bias.
MAR (Missing At Random)
Missing values depend on other observed variables but not on the missing variable itself.
Example: Males are less likely to answer a survey question on mental health. While the missingness is predictable using other data, it cannot be directly observed. MAR values can often be safely imputed.
NMAR (Not Missing At Random)
Missing values are related to the value itself or hidden factors.
Example: Missing spouse names could indicate unmarried participants or deliberate omission. NMAR requires careful handling, as ignoring these values may bias the analysis.
Strategies to Handle Missing Values
Dropping Missing Values
If the proportion of missing data is very small (e.g., < 5 %), you may choose to ignore the missing values:
clean_data <- na.omit(dataset)
However, dropping too many rows may lead to loss of valuable information, especially when missingness follows a pattern.
Imputing Missing Values
Imputation involves filling missing values with plausible substitutes, preserving the structure and distribution of the data.
- Numeric Data – Use mean, median, or moving averages.
- Categorical Data – Use mode (most frequent value) or prediction‑based imputation.
- Special Cases – Use placeholder values like
-1for age or"Unknown"for categorical fields (useful for quick exploration).
Examples
# Mean imputation for numeric variable
dataset$age[is.na(dataset$age)] <- mean(dataset$age, na.rm = TRUE)
# Mode imputation for categorical variable
dataset$gender[is.na(dataset$gender)] <- as.character(stats::mode(dataset$gender))
Advanced Imputation with R Packages
R provides several powerful packages for robust imputation, including:
- Hmisc – General‑purpose imputation.
- missForest – Non‑parametric imputation using Random Forest.
- Amelia – Multiple imputation for time‑series and cross‑sectional data.
- mice – Multivariate Imputation via Chained Equations (gold standard for MAR data).
We’ll focus on the mice package, which is widely used for MAR missing values and provides multiple imputed datasets for robust modeling.
Using the mice Package
Step 1: Load Packages and Data
library(mice)
library(VIM)
library(lattice)
data(nhanes) # NHANES dataset
# NHANES contains 25 observations and 4 variables: age, bmi, hyp (hypertension), and chl (cholesterol).
# Several variables have missing values: bmi, hyp, and chl.
# Age is coded in bands (1, 2, 3) and better treated as a factor:
nhanes$age <- as.factor(nhanes$age)
Step 2: Understand Missing Patterns
md.pattern(nhanes)
The function shows the pattern of missingness, including which variables are missing together.
Visualizing Missing Data
nhanes_miss <- aggr(
nhanes,
col = mdc(1:2),
numbers = TRUE,
sortVars = TRUE,
labels = names(nhanes),
cex.axis = .7,
gap = 3,
ylab = c("Proportion of missingness", "Missingness Pattern")
)
marginplot(
nhanes[, c("chl", "bmi")],
col = mdc(1:2),
cex.numbers = 1.2,
pch = 19
)
aggr()shows the proportion of missingness per variable.marginplot()displays the distribution of missing vs. observed values, helping identify MCAR patterns.
Step 3: Impute Missing Values
mice_imputes <- mice(nhanes, m = 5, maxit = 40)
# m = 5 → creates 5 imputed datasets
# maxit = 40 → maximum iterations per imputation
# Default method: Predictive Mean Matching (PMM) for numeric variables
mice_imputes$method
Step 4: Extract an Imputed Dataset
imputed_data <- complete(mice_imputes, 5) # Using the 5th imputed dataset
Step 5: Evaluate Imputation Quality
XY Plot
xyplot(mice_imputes, bmi ~ chl | .imp, pch = 20, cex = 1.4)
- Blue points = observed data
- Red points = imputed data
A good imputation shows the red points closely matching the blue ones.
Density Plot
densityplot(mice_imputes)
Compares the distribution of observed and imputed values.
Step 6: Modeling with Multiple Imputed Datasets
lm_5_model <- with(mice_imputes, lm(chl ~ age + bmi + hyp))
combo_5_model <- pool(lm_5_model)
summary(combo_5_model)
with() fits the model on each imputed dataset, and pool() combines the results, providing more reliable estimates than using a single imputed dataset.
Summary
- Missing values are a common challenge in data analysis.
- Depending on the type and amount of missingness, they can be ignored, dropped, or imputed.
- R packages such as mice, Hmisc, Amelia, and missForest offer advanced imputation methods.
- The mice package is particularly powerful for MAR missing values, allowing multiple imputations and robust modeling.
- Visualizing missingness with VIM helps determine the nature of missing values and guides proper imputation.
Proper handling of missing data ensures accurate, unbiased, and reliable models—a cornerstone of successful data‑science projects.