A.I, Data and Software Engineering

Lasso vs Ridge vs Elastic Net – Machine learning

L

Lasso, Ridge, and Elastic Net are excellent methods to improve the performance of your linear model. This post will summarise the usage of these regularization techniques.

Bias:

Biases are the underlying assumptions that are made by data to simplify the target function. Bias does help us generalize the data better and make the model less sensitive to single data points. It also decreases the training time because of the decrease in complexity of the target function High bias suggest that there is more assumption taken on the target function. This leads to the underfitting of the model sometimes.
Examples of High bias Algorithms include Linear Regression, Logistic Regression etc.

Variance:

In machine learning, Variance is a type of error that occurs due to a model’s sensitivity to small fluctuations in the dataset. The high variance would cause an algorithm to model the outliers/noise in the training set. This is most commonly referred to as overfitting. In this situation, the model basically learns every data point and does not offer a good prediction when it is tested on a novel dataset.
Examples of High variance Algorithms include Decision Tree, KNN etc.

Overfitting vs Underfitting vs Just Right

Error in Linear Regression :

Let’s consider a simple regression model that aims to predict a variable Y, from the linear combination of variables X and a normally distributed error term \epsilon

Y = \beta * X + \epsilon

where \epsilon is the normal distribution that adds some noise in the prediction.

Here \beta is the vector representing the coefficient of variables in the X that we need to estimate from the training data. We need to estimate them in such a way that it produces the lowest residual error. This error is defined as:

L_{ols}(\hat{\beta}) = \sum_{i=0}^n ||y_i - x_i * \hat{\beta}|| ^ 2 = ||Y - X* \hat{\beta}||^2

To calculate \hat{\beta} we use the following matrix transformation.
\hat{\beta_{ols}} = \left ( X^{T}X \right )^{-1}\left ( X^{T}Y \right )
Here Bias and Variance of \hat{\beta} can be defined as:
Bias(hat{\beta}) = E\left ( \hat{\beta} \right ) - \beta
and
Variance\left ( \hat{\beta} \right ) =\sigma ^{2}\left ( {X}'X \right )^{-1}
We can simplify the error term of the OLS equation defined above in terms of bias and variance as follows:

Error-term = \left ( E\left ( X\hat{\beta} \right ) - X\beta \right )^{2} +E\left ( X\hat{\beta} - E\left ( X\hat{\beta} \right ) \right )^{2}+\sigma^{2}

The first term of the above equation represents Bias2. The second term represents Variance and the third term (\sigma^{2}) is the nonreducible error term.

Bias vs Variance Tradeoff

Understanding Bias-Variance Tradeoff | by Meet Patel | Medium
Variance-Bias-Visualization

Let us consider that we have a very accurate model, this model has a low error in predictions and it’s not from the target (which is represented by bull’s eye). This model has low bias and variance. Now, if the predictions are scattered here and there then that is the symbol of high variance, also if the predictions are far from the target then that is the symbol of high bias.
Sometimes we need to choose between low variance and low bias. There is an approach that prefers some bias over high variance, this approach is called Regularization. It works well for most of the classification/regression problems.

Ridge Regression :

In Ridge regression, we add a penalty term that is equal to the square of the coefficient. The L2 term is equal to the square of the magnitude of the coefficients. We also add a coefficient \lambda to control that penalty term. In this case, if \lambda is zero then the equation is the basic OLS else if \lambda > 0 then it will add a constraint to the coefficient. As we increase the value of \lambda this constraint causes the value of the coefficient to tend towards zero. This leads to both low variance (as some coefficient leads to negligible effect on prediction) and low bias (minimization of coefficient reduce the dependency of prediction on a particular variable).

L_{ridge} = argmin_{\hat{\beta}}\left ({\left \| Y- \beta * X \right \|}^{2} + \lambda * {\left \| \beta \right \|}_{2}^{2} \right )

where \lambda is the regularization penalty.

Limitation of Ridge Regression: Ridge regression decreases the complexity of a model but does not reduce the number of variables since it never leads to a coefficient being zero rather only minimizes it. Hence, this model is not good for feature reduction.

Lasso Regression :


Lasso regression stands for Least Absolute Shrinkage and Selection Operator. It adds a penalty term to the cost function. This term is the absolute sum of the coefficients. As the value of coefficients increases from 0, this term penalizes, thus, causing the model to decrease the value of coefficients in order to reduce loss. The difference between ridge and lasso regression is that it tends to make coefficients to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero.

L_{lasso} = argmin_{\hat{\beta}}\left ({\left \| Y- \beta * X \right \|}^{2} + \lambda * {\left \| \beta \right \|}_{1} \right )

Limitation of Lasso Regression:

  • Lasso sometimes struggles with some types of data. If the number of predictors (p) is greater than the number of observations (n), Lasso will pick at most n predictors as non-zero, even if all predictors are relevant (or may be used in the test set).
  • If there are two or more highly collinear variables then LASSO regression select one of them randomly which is not good for the interpretation of data

Elastic Net

Elastic Net combines characteristics of both lasso and ridge. Elastic Net reduces the impact of different features while not eliminating all of the features.

The formula as you can see below is the sum of the lasso and ridge formulas.

Elastic Net Formula: Ridge + Lasso

L = ∑( Ŷi– Yi)² + λ∑ β² + λ∑ |β|

To conclude, Lasso, Ridge, and Elastic Net are excellent methods to improve the performance of your linear model. This includes if you are running a neural network, a collection of linear models. Lasso will eliminate many features, and reduce overfitting in your linear model. Ridge will reduce the impact of features that are not important in predicting your y values. Elastic Net combines feature elimination from Lasso and feature coefficient reduction from the Ridge model to improve your model’s predictions.

Add comment

💬

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Categories