Ridge Regression vs Lasso Regression

Published: (February 3, 2026 at 03:02 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Introduction

Linear regression is one of the most fundamental tools in a data scientist’s toolkit. At its core lies Ordinary Least Squares (OLS), a method that estimates model parameters by minimizing the sum of squared differences between predicted and actual values.

In many real‑world problems—such as house‑price prediction—datasets often contain many features, correlated variables, and noisy inputs. In such cases, traditional OLS regression becomes unstable and prone to over‑fitting. To address these challenges, regularisation techniques are used. The two most important regularisation‑based models are:

  • Ridge Regression (L2 regularisation)
  • Lasso Regression (L1 regularisation)

Ordinary Least Squares (OLS)

OLS estimates model parameters by minimising the sum of squared residuals between predicted and actual values:

[ \text{Loss}{\text{OLS}} = \sum{i=1}^{n} (y_i - \hat{y}_i)^2 ]

where (\hat{y}_i) represents the predicted price for observation (i).

OLS works well for small, clean datasets, but it struggles when:

  • There are many features
  • Features are highly correlated (multicollinearity)
  • Data contains noise

These situations lead to over‑fitting: the model performs well on training data but poorly on unseen data.

Regularisation in Linear Regression

Regularisation adds a penalty term to the loss function, charging the model for complexity. The model must now balance accuracy against simplicity rather than merely minimising error.

[ \text{Loss} = \text{Error} + \text{Penalty} ]

Large coefficients are discouraged, which typically yields models that generalise better to new data.

Ridge Regression (L2 Regularisation)

Ridge regression modifies the OLS loss function by adding an L2 penalty proportional to the sum of squared coefficients.

[ \text{Loss}{\text{Ridge}} = \underbrace{\sum{i=1}^{n}(y_i - \hat{y}i)^2}{\text{RSS}} ;+; \lambda \underbrace{\sum_{j=1}^{p}\beta_j^{2}}_{\text{L2 penalty}} ]

  • (\lambda \ge 0) is the regularisation parameter.
  • The intercept (\beta_0) is not penalised.

Conceptual Effect

  • Shrinks coefficients smoothly
  • Reduces model variance
  • Keeps all features
  • Handles multicollinearity well

Key Property

Ridge does not perform feature selection; coefficients are reduced but never become exactly zero.

Python Example

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)          # alpha == λ
ridge.fit(X_train_scaled, y_train)

y_pred_ridge = ridge.predict(X_test_scaled)

Lasso Regression (L1 Regularisation)

Lasso adds an L1 penalty, which is the sum of the absolute values of the coefficients.

[ \text{Loss}{\text{Lasso}} = \underbrace{\sum{i=1}^{n}(y_i - \hat{y}i)^2}{\text{RSS}} ;+; \lambda \underbrace{\sum_{j=1}^{p}|\beta_j|}_{\text{L1 penalty}} ]

  • (\lambda) controls the strength of regularisation.

Conceptual Effect

  • Creates sparse models
  • Forces some coefficients to be exactly zero
  • Automatically removes weak features

Key Property

Lasso performs feature selection, producing simpler and more interpretable models.

Python Example

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)          # alpha == λ
lasso.fit(X_train_scaled, y_train)

y_pred_lasso = lasso.predict(X_test_scaled)

Comparing Ridge and Lasso

AspectRidgeLasso
Feature selectionRetains all features (coefficients are shrunken)Sets some coefficients to zero → automatic selection
Behaviour with correlated featuresDistributes weight smoothly among correlated predictorsPicks one predictor, zeroes out the others
Interpretability“Price depends on all 10 factors with varying importance.”“Price primarily depends on size, location, and age; other factors don’t matter.”

Example with two correlated predictors (size and number of rooms, (r = 0.85)):

  • Ridge: Size = $120/sq ft, Rooms = $8,000/room (both retained)
  • Lasso: Size = $180/sq ft, Rooms = $0 (chooses one, drops the other)

Application Scenario: House‑Price Prediction

Assume the dataset contains:

  • House size
  • Number of bedrooms
  • Distance to the city centre
  • Number of nearby schools
  • Several noisy or weak features

When to use Ridge

  • Most features are expected to influence price
  • Multicollinearity is present
  • You need stable predictions

When to use Lasso

  • Only a few features truly matter
  • Many variables add noise
  • Model interpretability is important

Python Implementation

Data Preparation

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error

# Assume df is a pandas DataFrame containing the data
X = df[['size', 'bedrooms', 'distance_city', 'schools_nearby', 'noise_feature']]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

OLS Model

ols = LinearRegression()
ols.fit(X_train_scaled, y_train)

y_pred_ols = ols.predict(X_test_scaled)
mse_ols = mean_squared_error(y_test, y_pred_ols)
print(f'OLS MSE: {mse_ols:.2f}')

Ridge Model

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

y_pred_ridge = ridge.predict(X_test_scaled)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
print(f'Ridge MSE: {mse_ridge:.2f}')

Lasso Model

lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)

y_pred_lasso = lasso.predict(X_test_scaled)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
print(f'Lasso MSE: {mse_lasso:.2f}')

Choosing the Right Model for House Prices

  • Ridge Regression – preferred when all features contribute meaningfully (e.g., size, bedrooms, schools, distance).
  • Lasso Regression – more suitable when only a few features are truly important and the rest add noise, thanks to its built‑in feature‑selection capability.

Model Evaluation and Overfitting Detection

Overfitting can be detected by comparing training and testing performance:

  • High training score but low test score → overfitting.
  • Similar training and test scores → good generalization.

Residual analysis also plays a key role. Residuals should be randomly distributed; visible patterns may indicate missing variables or non‑linear relationships.

Conclusion

  • OLS is simple but prone to overfitting in complex datasets.
  • Ridge and Lasso regression introduce regularization to improve stability and generalization.
    • Ridge is best when all features matter.
    • Lasso is preferred for sparse, interpretable models.

Understanding when and how to apply these techniques is essential for both exams and real‑world machine‑learning problems.

Back to Blog

Related posts

Read more »