Advanced Imputation with R Packages

Published: 5 days ago (December 10, 2025 at 02:42 PM EST)

4 min read

Source: Dev.to

What Are Missing Values?

Imagine you are collecting survey data where participants fill out personal details. For someone who is married, the marital status will be married and they may provide the names of their spouse and children. For unmarried respondents, these fields will naturally be left blank.

This is a genuine example of missing values, but missing data can also occur due to human error (forgetting to enter data), incorrect entries (like a negative age), or system errors during data collection.

Before handling missing data, it’s important to identify which type of missingness you are dealing with.

Types of Missing Values

Missing data is typically classified into three categories:

MCAR (Missing Completely At Random)

Missing values occur randomly with no relationship to any other variable.
Example: A survey participant accidentally skips a question. MCAR is rare but easiest to handle because the missingness does not introduce bias.

MAR (Missing At Random)

Missing values depend on other observed variables but not on the missing variable itself.
Example: Males are less likely to answer a survey question on mental health. While the missingness is predictable using other data, it cannot be directly observed. MAR values can often be safely imputed.

NMAR (Not Missing At Random)

Missing values are related to the value itself or hidden factors.
Example: Missing spouse names could indicate unmarried participants or deliberate omission. NMAR requires careful handling, as ignoring these values may bias the analysis.

Strategies to Handle Missing Values

Dropping Missing Values

If the proportion of missing data is very small (e.g., < 5 %), you may choose to ignore the missing values:

clean_data <- na.omit(dataset)

However, dropping too many rows may lead to loss of valuable information, especially when missingness follows a pattern.

Imputing Missing Values

Imputation involves filling missing values with plausible substitutes, preserving the structure and distribution of the data.

Numeric Data – Use mean, median, or moving averages.
Categorical Data – Use mode (most frequent value) or prediction‑based imputation.
Special Cases – Use placeholder values like -1 for age or "Unknown" for categorical fields (useful for quick exploration).

Examples

# Mean imputation for numeric variable
dataset$age[is.na(dataset$age)] <- mean(dataset$age, na.rm = TRUE)

# Mode imputation for categorical variable
dataset$gender[is.na(dataset$gender)] <- as.character(stats::mode(dataset$gender))

Advanced Imputation with R Packages

R provides several powerful packages for robust imputation, including:

Hmisc – General‑purpose imputation.
missForest – Non‑parametric imputation using Random Forest.
Amelia – Multiple imputation for time‑series and cross‑sectional data.
mice – Multivariate Imputation via Chained Equations (gold standard for MAR data).

We’ll focus on the mice package, which is widely used for MAR missing values and provides multiple imputed datasets for robust modeling.

Using the `mice` Package

Step 1: Load Packages and Data

library(mice)
library(VIM)
library(lattice)

data(nhanes)  # NHANES dataset
# NHANES contains 25 observations and 4 variables: age, bmi, hyp (hypertension), and chl (cholesterol).
# Several variables have missing values: bmi, hyp, and chl.

# Age is coded in bands (1, 2, 3) and better treated as a factor:
nhanes$age <- as.factor(nhanes$age)

Step 2: Understand Missing Patterns

md.pattern(nhanes)

The function shows the pattern of missingness, including which variables are missing together.

Visualizing Missing Data

nhanes_miss <- aggr(
  nhanes,
  col = mdc(1:2),
  numbers = TRUE,
  sortVars = TRUE,
  labels = names(nhanes),
  cex.axis = .7,
  gap = 3,
  ylab = c("Proportion of missingness", "Missingness Pattern")
)

marginplot(
  nhanes[, c("chl", "bmi")],
  col = mdc(1:2),
  cex.numbers = 1.2,
  pch = 19
)

aggr() shows the proportion of missingness per variable.
marginplot() displays the distribution of missing vs. observed values, helping identify MCAR patterns.

Step 3: Impute Missing Values

mice_imputes <- mice(nhanes, m = 5, maxit = 40)
# m = 5  → creates 5 imputed datasets
# maxit = 40 → maximum iterations per imputation
# Default method: Predictive Mean Matching (PMM) for numeric variables
mice_imputes$method

Step 4: Extract an Imputed Dataset

imputed_data <- complete(mice_imputes, 5)  # Using the 5th imputed dataset

Step 5: Evaluate Imputation Quality

XY Plot

xyplot(mice_imputes, bmi ~ chl | .imp, pch = 20, cex = 1.4)

Blue points = observed data
Red points = imputed data

A good imputation shows the red points closely matching the blue ones.

Density Plot

densityplot(mice_imputes)

Compares the distribution of observed and imputed values.

Step 6: Modeling with Multiple Imputed Datasets

lm_5_model   <- with(mice_imputes, lm(chl ~ age + bmi + hyp))
combo_5_model <- pool(lm_5_model)
summary(combo_5_model)

with() fits the model on each imputed dataset, and pool() combines the results, providing more reliable estimates than using a single imputed dataset.

Summary

Missing values are a common challenge in data analysis.
Depending on the type and amount of missingness, they can be ignored, dropped, or imputed.
R packages such as mice, Hmisc, Amelia, and missForest offer advanced imputation methods.
The mice package is particularly powerful for MAR missing values, allowing multiple imputations and robust modeling.
Visualizing missingness with VIM helps determine the nature of missing values and guides proper imputation.

Proper handling of missing data ensures accurate, unbiased, and reliable models—a cornerstone of successful data‑science projects.

Advanced Imputation with R Packages

What Are Missing Values?

Types of Missing Values

MCAR (Missing Completely At Random)

MAR (Missing At Random)

NMAR (Not Missing At Random)

Strategies to Handle Missing Values

Dropping Missing Values

Imputing Missing Values

Examples

Advanced Imputation with R Packages

Using the `mice` Package

Step 1: Load Packages and Data

Step 2: Understand Missing Patterns

Visualizing Missing Data

Step 3: Impute Missing Values

Step 4: Extract an Imputed Dataset

Step 5: Evaluate Imputation Quality

Step 6: Modeling with Multiple Imputed Datasets

Summary

Related posts

5 Essential Methods: How to Master Footnotes in Excel for Professional Reports.

Philippine Corruption From Wikipedia Data

How I Began My Data Science Journey with R in the Last Month

THE ROLE OF EXCEL IN BUSINESS INTELLIGENCE AND DATA DRIVEN DECISION MAKING

What Are Missing Values?

Types of Missing Values

MCAR (Missing Completely At Random)

MAR (Missing At Random)

NMAR (Not Missing At Random)

Strategies to Handle Missing Values

Dropping Missing Values

Imputing Missing Values

Examples

Advanced Imputation with R Packages

Using the mice Package

Step 1: Load Packages and Data

Step 2: Understand Missing Patterns

Visualizing Missing Data

Step 3: Impute Missing Values

Step 4: Extract an Imputed Dataset

Step 5: Evaluate Imputation Quality

Step 6: Modeling with Multiple Imputed Datasets

Summary

Related posts

5 Essential Methods: How to Master Footnotes in Excel for Professional Reports.

Philippine Corruption From Wikipedia Data

How I Began My Data Science Journey with R in the Last Month

THE ROLE OF EXCEL IN BUSINESS INTELLIGENCE AND DATA DRIVEN DECISION MAKING

Using the `mice` Package

Step 1: Load Packages and Data

Step 2: Understand Missing Patterns

Step 3: Impute Missing Values

Step 4: Extract an Imputed Dataset

Step 5: Evaluate Imputation Quality

Step 6: Modeling with Multiple Imputed Datasets