Sharpening the Axe: Performing Principal Component Analysis (PCA) in R for Modern Machine Learning
Source: Dev.to
“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.”
— Abraham Lincoln
This quote resonates strongly with modern machine learning and data science. In real‑world projects, the majority of time is not spent on modeling, but on data preprocessing, feature engineering, and dimensionality reduction.
One of the most powerful and widely used dimensionality‑reduction techniques is Principal Component Analysis (PCA). PCA helps us transform high‑dimensional data into a smaller, more informative feature space—often improving model performance, interpretability, and computational efficiency.
In this article you will learn:
- The conceptual foundations of PCA
- How to implement PCA in R using modern, industry‑standard practices
Table of Contents
- Lifting the Curse with Principal Component Analysis
- Curse of Dimensionality in Simple Terms
- Key Insights from Shlens’ PCA Perspective
- Conceptual Background of PCA
- Implementing PCA in R (Modern Approach)
- Loading and Preparing the Iris Dataset
- Scaling and Standardization
- Covariance Matrix and Eigen Decomposition
- PCA with
prcomp()
- Understanding PCA Outputs
- Variance Explained
- Loadings and Scores
- Scree Plot and Biplot
- PCA in a Modeling Workflow (Naïve Bayes Example)
- Summary and Practical Takeaways
Lifting the Curse with Principal Component Analysis
A common myth in analytics is:
“More features and more data will always improve model accuracy.”
In practice, this is often false. When the number of features grows faster than the number of observations, models become:
- Unstable
- Harder to generalize
- Prone to over‑fitting
This phenomenon is known as the curse of dimensionality. PCA helps address it by reducing dimensionality while preserving most of the informational content.
Curse of Dimensionality in Simple Terms
- Adding more features can decrease model accuracy.
- Model complexity grows exponentially with dimensionality.
- Distance‑based and probabilistic models degrade rapidly.
Two general ways to mitigate the curse:
- Collect more data – often expensive or impossible.
- Reduce the number of features – the preferred, practical approach.
Dimensionality‑reduction techniques like PCA fall into the second category.
Shlens’ Perspective on PCA
In his well‑known paper, Jonathon Shlens describes PCA using a simple analogy: observing the motion of a pendulum.
- If the pendulum moves in one direction but we don’t know that direction, we may need several cameras (features) to capture its motion.
- PCA rotates the coordinate system so that we capture the motion with fewer, orthogonal views.
In essence, PCA:
- Transforms correlated variables into uncorrelated (orthogonal) components.
- Orders these components by variance explained.
- Allows us to retain only the most informative components.
Conceptual Background of PCA
Assume a dataset with:
- m observations
- n features
Represented as an ( m \times n ) matrix A.
PCA transforms A into a new matrix A′ of size ( m \times k ) where ( k \le n ). The transformation is based on the eigen‑decomposition of the covariance matrix (or singular value decomposition).
# eigen_data$values -> eigenvalues (variance explained)
# eigen_data$vectors -> eigenvectors (principal axes)
Implementing PCA in R (Modern Approach)
Loading and Preparing the Iris Dataset
data(iris)
df <- iris[, 1:4] # use only numeric features
head(df)
Scaling and Standardization
df_scaled <- scale(df) # zero‑mean, unit‑variance scaling
Covariance Matrix and Eigen Decomposition
cov_mat <- cov(df_scaled)
eigen_data <- eigen(cov_mat)
# eigen_data$values -> eigenvalues (variance explained)
# eigen_data$vectors -> eigenvectors (principal axes)
Performing PCA with prcomp()
# Why prcomp()?
# • Uses singular value decomposition (SVD)
# • Numerically more stable
# • Works better for high‑dimensional data
pca_res <- prcomp(df_scaled, center = FALSE, scale. = FALSE)
summary(pca_res)
Just like sharpening the axe, investing time in feature engineering and dimensionality reduction pays off exponentially.
Our mission is “to enable businesses unlock value in data.” We do many activities to achieve that—helping you solve tough problems is just one of them. For over 20 years, we’ve partnered with more than 100 clients—from Fortune 500 companies to mid‑sized firms—to solve complex data analytics challenges.
Our services include:
Turning raw data into strategic insight.
Understanding PCA Outputs
Variance Explained
explained_variance <- pca_res$sdev^2 / sum(pca_res$sdev^2)
explained_variance
Loadings and Scores
loadings <- pca_res$rotation
scores <- pca_res$x
head(loadings)
head(scores)
Scree Plot and Biplot
# Scree plot
plot(explained_variance, type = "b", xlab = "Principal Component",
ylab = "Proportion of Variance Explained")
# Biplot
biplot(pca_res)
PCA in a Modeling Workflow (Naïve Bayes Example)
- Split the data into training and test sets.
- Apply
prcomp()on the training set and retain the top k components. - Transform both training and test sets using the same rotation matrix.
- Train a Naïve Bayes classifier on the reduced‑dimensional training data.
- Evaluate performance on the test set.
library(e1071) # for Naïve Bayes
set.seed(123)
train_idx <- sample(seq_len(nrow(df_scaled)), size = 0.7 * nrow(df_scaled))
train_data <- df_scaled[train_idx, ]
test_data <- df_scaled[-train_idx, ]
pca_train <- prcomp(train_data)
k <- 2 # keep first two PCs
train_pc <- predict(pca_train, train_data)[, 1:k]
test_pc <- predict(pca_train, test_data)[, 1:k]
nb_model <- naiveBayes(train_pc, iris$Species[train_idx])
pred <- predict(nb_model, test_pc)
confusionMatrix <- table(Predicted = pred, Actual = iris$Species[-train_idx])
confusionMatrix
Summary and Practical Takeaways
- PCA is a cornerstone technique for tackling the curse of dimensionality.
- Proper scaling of features is essential before applying PCA.
prcomp()(SVD‑based) is the preferred R function for robust PCA.- Examine variance explained to decide how many components to retain.
- Integrate PCA early in the modeling pipeline to improve speed and generalization.
By sharpening the “axe” of your data—through careful preprocessing and dimensionality reduction—you set the stage for more reliable, interpretable, and efficient machine‑learning models.