Path: blob/master/ML Regression Analysis/LASSO and Ridge Regression.ipynb
3074 views
1. Introduction to Regularization
Regularization is a technique to prevent overfitting in linear regression by adding a penalty term to the cost function. This penalty discourages large coefficients and helps improve the model's generalization.
Two commonly used regularization techniques are:
Ridge Regression (L2 Regularization)
Lasso Regression (L1 Regularization)
Lasso and Ridge regression are regularization techniques used to improve the performance of linear regression models by addressing overfitting and multicollinearity. They differ in how they penalize model coefficients, making them suitable for different scenarios.
When to Use Lasso Regression?
Lasso regression, or L1 regularization, is ideal when you suspect that only a subset of predictors is important. It adds a penalty based on the absolute values of the coefficients, which can shrink some coefficients to exactly zero. This makes Lasso effective for feature selection, as it automatically excludes irrelevant predictors.
For example, in high-dimensional datasets like genetic studies, where only a few genes out of thousands are relevant, Lasso helps identify the most impactful features while ignoring the rest. It is best suited when you aim to simplify the model by retaining only the most significant predictors.
When to Use Ridge Regression?
Ridge regression, or L2 regularization, is more appropriate when all predictors are potentially relevant. It adds a penalty proportional to the square of the coefficients, shrinking them towards zero but not eliminating any. This ensures that all features contribute to the model, albeit with reduced influence, which helps mitigate overfitting.
For instance, in predicting house prices, where features like size, location, and number of bedrooms are all relevant, Ridge regression ensures that no feature is excluded while controlling the magnitude of their coefficients. It is particularly useful when multicollinearity exists among predictors.
Key Differences
Lasso performs feature selection by setting some coefficients to zero, making it suitable for sparse models. Ridge, on the other hand, retains all predictors and is better for scenarios where all features are important. Ridge is computationally faster, while Lasso may require more time due to its feature selection process.
By understanding the nature of your data and the importance of predictors, you can choose between Lasso and Ridge regression to optimize model performance.
2. Standard Linear Regression Model
For a linear regression model:
Where:
( y ) = Target variable
( x_j ) = Feature variables
( \beta_j ) = Coefficients
( \epsilon ) = Error term
The Ordinary Least Squares (OLS) cost function is:
[ J(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}i)^2 = \sum{i=1}^{n} (y_i - (\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}))^2 ]
3. Ridge Regression (L2 Regularization)
Ridge Cost Function:
Where:
The second term is the L2 penalty (sum of squared coefficients).
( \lambda \geq 0 ) is the regularization parameter.
Effect of ( \lambda ):
If ( \lambda = 0 ): Ridge becomes OLS (no regularization).
If ( \lambda ) is large: Coefficients shrink towards zero but never exactly zero.
4. Lasso Regression (L1 Regularization)
Lasso Cost Function:
The penalty term is the L1 norm of the coefficients (absolute values).
Encourages some coefficients to become exactly zero → performs feature selection.
5. Comparison: Ridge vs Lasso
Aspect | Ridge (L2) | Lasso (L1) |
---|---|---|
Penalty | ( \sum \beta_j^2 ) | ( \sum |
Shrinks Coefficients | Yes | Yes |
Sets Coefficients to Zero | No | Yes (Sparse model) |
Use Case | Multicollinearity handling | Feature selection |
6. Regularization Path Behavior
Ridge: Coefficients gradually shrink but remain non-zero.
Lasso: Coefficients shrink and some become exactly zero as ( \lambda ) increases.
7. Key Hyperparameter
( \lambda ) (or
alpha
in scikit-learn) controls the strength of the regularization.Higher ( \lambda ): Stronger regularization → smaller coefficients.
Lower ( \lambda ): Weaker regularization → behaves like OLS.
8. Geometric Interpretation
Ridge: Constrains coefficients within a circle (L2 norm ball).
Lasso: Constrains coefficients within a diamond (L1 norm ball), leading to sparsity.
9. When to Use?
Ridge: When all features are relevant and we want to reduce overfitting.
Lasso: When we need automatic feature selection and a sparse model.
10. Combined Approach: Elastic Net
Elastic Net combines L1 and L2 penalties:
[ J(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda_1 \sum |\beta_j| + \lambda_2 \sum \beta_j^2 ]
This balances feature selection (Lasso) and coefficient shrinkage (Ridge).
Simple example using Python
Observations:
Ridge: Coefficients shrink gradually as alpha increases, but none become exactly zero.
Lasso: Coefficients shrink and some become exactly zero as alpha increases → feature selection.
Example using Dataset
Features and target
Splitting the data prevents data leakage and helps evaluate performance on unseen data.
StandardScaler standardizes features (mean=0, std=1) — essential for regularized models to treat all features fairly.
Ridge regression
Ridge adds L2 penalty (squares of coefficients) to reduce model complexity.
Good for multicollinearity (when features are correlated).
Helps prevent overfitting by shrinking large coefficients.
Lasso
Lasso adds L1 penalty (absolute values of coefficients).
Drives some coefficients to zero → helps with feature selection.
Best when you suspect some features may be irrelevant.
ElasticNet
Combines both L1 and L2 penalties (Ridge + Lasso).
Useful when you expect both feature selection and coefficient shrinkage are needed.
Visualize the magnitude and sparsity of model coefficients.
Lasso shows zero values for less important features.
Ridge keeps all coefficients but shrinks them.
ElasticNet shows a hybrid behavior.
regularization strength (alpha) shrinks coefficients.
Lasso will start dropping features (to 0) as alpha increases.
Ridge shrinks but never zeroes out coefficients.
Grid Search for h_price Data
alpha controls the strength of regularization.
l1_ratio controls the mix of L1 and L2 for ElasticNet.
Best Hyperparameters Identified (from Grid Search)
Interpretation in 5 Key Points
All Three Models Performed Exceptionally Well
Each model has a Test R² ≈ 0.98, meaning they explain ~98% of the variance in house prices on unseen test data.
This indicates a very strong fit — your features are highly predictive.
Ridge Regression Had the Slight Edge
*ElasticNet delivered the highest Test R² (0.98192), outperforming Lasso and ElasticNet by a very small margin. *This suggests Elastic was slightly better at generalizing on the test set — possibly due to better handling of multicollinearity.
Lasso Nearly Equal but May Offer Feature Selection
*Lasso scored 0.98191 — almost identical to Ridge. *Even if slightly lower in R², it may be preferred if sparse models or feature elimination are desirable (e.g., fewer predictors).
ElasticNet Balanced Performance
ElasticNet, which blends Ridge and Lasso, had the lowest R² (0.98184) — but the difference is very minor.
It may still be preferred if you want both feature selection (via L1) and multicollinearity handling (via L2).
Decision Implication
Since performance differences are minimal, your choice should depend on secondary criteria:
Interpretability → Lasso
Stability with correlated features → Ridge
Balanced regularization → ElasticNet
If you just want maximum predictive performance, Ridge wins by a small margin here.