Table of contents

Whenever we discuss prediction models, it’s important to understand prediction errors, i.e. bias and variance. A proper understanding of these concepts would help us not only to build accurate models but also to avoid the mistake of over-fitting and under-fitting.

We quickly explain the two concepts using the following illustration.

Suppose that a man is trying to shoot in the bull’s eye. ** His shooting skill can be considered the prediction model.** The shooting results are the model’s prediction.

## What is bias?

- Bias shows the difference between the
**prediction (average) and the correct value**. - If the shoot results are far-away from the bull’s eye, the bias is high and likewise.

#### Some causes of high bias:

- Oversimplifies the model
- Not taking into account all the key features
- Not enough data
- Wrong model selection

## What is variance?

- Variance shows the spread of our data.
- Or the variability of model prediction for a given data point or a value

#### Some causes on high variance:

- Noisy training dataset
- Sparse dataset
- Algorithm lack of generalization to capture the underlying patterns

## Overfitting and Underfitting

**Under-fitting**: often high bias + low variance**Over-fitting**: often low bias + high variance, good at training dataset, bad at testing dataset

1. Under-fit (high bias): More training data doesn’t help, so don’t waste time on collecting more data.

2. Over-fit (high variance): getting more training data is likely to help.

Choosing reasonable number of features, degree of polynomial, and appropriate regularization parameter (lambda) is the key to keep balance between Overfit and Underfit.

Training set (60%), Cross Verification Set (20%), Test Set (20%) is helpful in choosing the best polynomial degree and regularization parameter