Ridge Regression vs Lasso Regression
Source: Dev.to
Introduction
Linear regression is one of the most fundamental tools in a data scientist’s toolkit. At its core lies Ordinary Least Squares (OLS), a method that estimates model parameters by minimizing the sum of squared differences between predicted and actual values.
In many real‑world problems—such as house‑price prediction—datasets often contain many features, correlated variables, and noisy inputs. In such cases, traditional OLS regression becomes unstable and prone to over‑fitting. To address these challenges, regularisation techniques are used. The two most important regularisation‑based models are:
- Ridge Regression (L2 regularisation)
- Lasso Regression (L1 regularisation)
Ordinary Least Squares (OLS)
OLS estimates model parameters by minimising the sum of squared residuals between predicted and actual values:
[ \text{Loss}{\text{OLS}} = \sum{i=1}^{n} (y_i - \hat{y}_i)^2 ]
where (\hat{y}_i) represents the predicted price for observation (i).
OLS works well for small, clean datasets, but it struggles when:
- There are many features
- Features are highly correlated (multicollinearity)
- Data contains noise
These situations lead to over‑fitting: the model performs well on training data but poorly on unseen data.
Regularisation in Linear Regression
Regularisation adds a penalty term to the loss function, charging the model for complexity. The model must now balance accuracy against simplicity rather than merely minimising error.
[ \text{Loss} = \text{Error} + \text{Penalty} ]
Large coefficients are discouraged, which typically yields models that generalise better to new data.
Ridge Regression (L2 Regularisation)
Ridge regression modifies the OLS loss function by adding an L2 penalty proportional to the sum of squared coefficients.
[ \text{Loss}{\text{Ridge}} = \underbrace{\sum{i=1}^{n}(y_i - \hat{y}i)^2}{\text{RSS}} ;+; \lambda \underbrace{\sum_{j=1}^{p}\beta_j^{2}}_{\text{L2 penalty}} ]
- (\lambda \ge 0) is the regularisation parameter.
- The intercept (\beta_0) is not penalised.
Conceptual Effect
- Shrinks coefficients smoothly
- Reduces model variance
- Keeps all features
- Handles multicollinearity well
Key Property
Ridge does not perform feature selection; coefficients are reduced but never become exactly zero.
Python Example
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0) # alpha == λ
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled)
Lasso Regression (L1 Regularisation)
Lasso adds an L1 penalty, which is the sum of the absolute values of the coefficients.
[ \text{Loss}{\text{Lasso}} = \underbrace{\sum{i=1}^{n}(y_i - \hat{y}i)^2}{\text{RSS}} ;+; \lambda \underbrace{\sum_{j=1}^{p}|\beta_j|}_{\text{L1 penalty}} ]
- (\lambda) controls the strength of regularisation.
Conceptual Effect
- Creates sparse models
- Forces some coefficients to be exactly zero
- Automatically removes weak features
Key Property
Lasso performs feature selection, producing simpler and more interpretable models.
Python Example
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1) # alpha == λ
lasso.fit(X_train_scaled, y_train)
y_pred_lasso = lasso.predict(X_test_scaled)
Comparing Ridge and Lasso
| Aspect | Ridge | Lasso |
|---|---|---|
| Feature selection | Retains all features (coefficients are shrunken) | Sets some coefficients to zero → automatic selection |
| Behaviour with correlated features | Distributes weight smoothly among correlated predictors | Picks one predictor, zeroes out the others |
| Interpretability | “Price depends on all 10 factors with varying importance.” | “Price primarily depends on size, location, and age; other factors don’t matter.” |
Example with two correlated predictors (size and number of rooms, (r = 0.85)):
- Ridge: Size = $120/sq ft, Rooms = $8,000/room (both retained)
- Lasso: Size = $180/sq ft, Rooms = $0 (chooses one, drops the other)
Application Scenario: House‑Price Prediction
Assume the dataset contains:
- House size
- Number of bedrooms
- Distance to the city centre
- Number of nearby schools
- Several noisy or weak features
When to use Ridge
- Most features are expected to influence price
- Multicollinearity is present
- You need stable predictions
When to use Lasso
- Only a few features truly matter
- Many variables add noise
- Model interpretability is important
Python Implementation
Data Preparation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
# Assume df is a pandas DataFrame containing the data
X = df[['size', 'bedrooms', 'distance_city', 'schools_nearby', 'noise_feature']]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
OLS Model
ols = LinearRegression()
ols.fit(X_train_scaled, y_train)
y_pred_ols = ols.predict(X_test_scaled)
mse_ols = mean_squared_error(y_test, y_pred_ols)
print(f'OLS MSE: {mse_ols:.2f}')
Ridge Model
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
print(f'Ridge MSE: {mse_ridge:.2f}')
Lasso Model
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)
y_pred_lasso = lasso.predict(X_test_scaled)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
print(f'Lasso MSE: {mse_lasso:.2f}')
Choosing the Right Model for House Prices
- Ridge Regression – preferred when all features contribute meaningfully (e.g., size, bedrooms, schools, distance).
- Lasso Regression – more suitable when only a few features are truly important and the rest add noise, thanks to its built‑in feature‑selection capability.
Model Evaluation and Overfitting Detection
Overfitting can be detected by comparing training and testing performance:
- High training score but low test score → overfitting.
- Similar training and test scores → good generalization.
Residual analysis also plays a key role. Residuals should be randomly distributed; visible patterns may indicate missing variables or non‑linear relationships.
Conclusion
- OLS is simple but prone to overfitting in complex datasets.
- Ridge and Lasso regression introduce regularization to improve stability and generalization.
- Ridge is best when all features matter.
- Lasso is preferred for sparse, interpretable models.
Understanding when and how to apply these techniques is essential for both exams and real‑world machine‑learning problems.