## What is Regularization? Why it is used?

Occam's razor (roughly) states that:

Given two hypotheses that make equivalent predictions, we should use the simpler one

Regularization encourages simpler model creation. Some methods penalize large parameter values (no single parameter should be that important) while others completely zero them out (remove them).

Both regularization methods presented below modify the loss function to include a regularization term. That is, you can use those regardless of the loss function you choose.

L1 regularization can force model parameter values to 0. In effect, this can eliminate some of the features/dimensions of your data. Also, your model becomes easier to interpret due to the reduced number of parameters.

L1 regularization works by adding the following term to the "original loss function":

$Loss = \text{Error Function} + \lambda \sum_1^n |w_i|$

Where $\lambda$ is a user-defined (hyper) parameter.

## L2 Regularization (Ridge)

The main difference between L1 and L2 regularization is that L2 doesn't enforce zeroes for parameter values. It does this by squaring the parameter values (instead of taking the absolute values). Here is the definition:

$Loss = \text{Error Function} + \lambda \sum_1^n w_i^{2}$

When using Ridge regularization, the final model will include all parameters, making it harder to interpret.

## Choosing $\lambda$

Choosing a value for the parameter is dependant on the model and training data you have.

• High $\lambda$ makes your model simple, but you can underfit the data. You won't make good predictions on any data.
• Low $\lambda$ makes your model complex, and you can overfit the data.