Table of contents

Machines learn by means of a loss function which reflects how well a specific model performs with the given data. If predictions deviate too much from actual results, loss function would yield a very large value. Gradually, with \(optimization\) function, parameters are modified accordingly to reduce the error in prediction. In this article, we will quickly review some common loss functions and their usage in the domain of machine/deep learning.

Unfortunately, there’s no one-size-fits-all loss function to algorithms in machine learning. There are various factors involved in choosing a loss function for a specific problem such as type of machine learning algorithm chosen, ease of calculating the derivatives and to some degree the percentage of outliers in the data set.

As we have two common problems, classification and regression, loss functions also can be sorted into two major categories — **Classification losses** and **Regression losses**.

1 2 3 4 5 | NOTE n - Number of training examples. i - ith training example in a data set. y(i)- Ground truth label for ith training example. y_hat(i) - Prediction for ith training example. |

## Classification Losses

In classification, we are trying to predict the output from a set of finite categorical values, e.g. given large data set of images of handwritten digits, categorizing them into one of 0–9 digits.

### Zero-one loss

In statistics and decision theory, a frequently used loss function is the *0-1 loss function*

$$L({\hat {y}},y)=I({\hat {y}}\neq y),\,$$

where *I* is the indicator function. The function is non-continuous and thus impractical to optimize.

1 2 3 4 5 6 7 | from sklearn.metrics import zero_one_loss y_pred = [1, 2, 3, 4] y_true = [2, 2, 9, 4] zero_one_loss(y_true, y_pred) L = zero_one_loss(y_true, y_pred, normalize=False) #L = 2 as there is two place difference |

**Hinge Loss/Multi-class SVM Loss**

In simple terms, the score of the correct category should be greater than the sum of scores of all incorrect categories by some safety margin (usually one). And hence hinge loss is used for maximum-margin classification, most notably for support vector machines. Although not differentiable, it’s a convex function which makes it easy to work with usual convex optimizers used in the machine learning domain.

** Mathematical formulation**:

$$SVMloss = \sum\limits_{j \# y_i} max(0, s_j – s_{y_i}+1)$$

Consider an example where we have three training examples and three classes to predict — Dog, cat and horse. Below the values predicted by our algorithm for each of the classes:

Img#1 | Img#2 | Img#3 | |

Dog | -0.39 | -4.61 | 1.03 |

Cat | 1.49 | 3.28 | -2.37 |

Horse | 4.21 | 1.46 | -2.27 |

Computing hinge losses for all 3 training examples:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | ## 1st training example max(0, (1.49) - (-0.39) + 1) + max(0, (4.21) - (-0.39) + 1) max(0, 2.88) + max(0, 5.6) #2.88 + 5.6 #8.48 (High loss as very wrong prediction) ## 2nd training example max(0, (-4.61) - (3.28)+ 1) + max(0, (1.46) - (3.28)+ 1) max(0, -6.89) + max(0, -0.82) #0 + 0 #0 (Zero loss as correct prediction) ## 3rd training example max(0, (1.03) - (-2.27)+ 1) + max(0, (-2.37) - (-2.27)+ 1) max(0, 4.3) + max(0, 0.9) #4.3 + 0.9 #5.2 (High loss as very wrong prediction) |

**Cross-Entropy Loss/Negative Log-Likelihood**

This is the most common setting for classification problems. Cross-entropy loss increases as the predicted probability diverge from the actual label.

** Mathematical formulation**:

$$CrossEntropyLoss = -(y_i log(\hat{y}_i) + (1 -y_i)log(1 – \hat{y}_i) )$$

Notice that when the actual label is 1 (\(y_i = 1\)), the second half of function disappears whereas in case actual label is 0 (\(y_i = 0\)) first half is dropped off. In short, we are just multiplying the log of the actually predicted probability for the ground truth class. An important aspect of this is that cross-entropy loss penalizes heavily the predictions that are *confident but wrong*.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import numpy as np predictions = np.array([[0.25,0.25,0.25,0.25], [0.01,0.01,0.01,0.96]]) targets = np.array([[0,0,0,1], [0,0,0,1]]) def cross_entropy(predictions, targets, epsilon=1e-10): predictions = np.clip(predictions, epsilon, 1. - epsilon) N = predictions.shape[0] ce_loss = -np.sum(np.sum(targets * np.log(predictions + 1e-5)))/N return ce_losscross_entropy_loss = cross_entropy(predictions, targets) print ("Cross entropy loss is: " + str(cross_entropy_loss)) #Cross entropy loss is: 0.7135329699138555 |

## Regression Losses

Regression, on the other hand, deals with predicting a continuous value, such as given the floor area, a number of rooms, predict the price of the house which can be any real positive number.

**Mean Square Error/Quadratic Loss/L2 Loss**

** Mathematical formulation**:-

$$MSE = \frac{\sum_{i=1}^{n} (y_i -\hat{y}_i)}{n}$$

As the name suggests, *Mean square error* is measured as the average of the squared difference between predictions and actual observations. It’s only concerned with the average magnitude of error irrespective of their direction. However, due to squaring, predictions which are far away from actual values are penalized heavily in comparison to less deviated predictions. Plus MSE has nice mathematical properties which make it easier to calculate gradients.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import numpy as np y_hat = np.array([0.000, 0.166, 0.333]) y_true = np.array([0.000, 0.254, 0.998]) def rmse(predictions, targets): differences = predictions - targets differences_squared = differences ** 2 mean_of_differences_squared = differences_squared.mean() rmse_val = np.sqrt(mean_of_differences_squared) return rmse_val print("d is: " + str(["%.8f" % elem for elem in y_hat])) print("p is: " + str(["%.8f" % elem for elem in y_true])) rmse_val = rmse(y_hat, y_true) print("rms error is: " + str(rmse_val)) |

1 2 3 4 5 | #d is: ['0.00000000', '0.16600000', '0.33300000'] #p is: ['0.00000000', '0.25400000', '0.99800000'] #rms error is: 0.3872849941150143 |

**Mean Absolute Error/L1 Loss**

*Mean absolute error*, on the other hand, is measured as the average sum of absolute differences between predictions and actual observations. Like MSE, this as well measures the magnitude of error without considering their direction. Unlike MSE, MAE loss function needs more complicated tools such as linear programming to compute the gradients. Plus MAE is more robust to outliers since it does not make use of the square.

** Mathematical formulation**:-

$$MAE = \frac{\sum_{i=1}^{n} |y_i -\hat{y}_i|}{n}$$

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import numpy as np y_hat = np.array([0.000, 0.166, 0.333]) y_true = np.array([0.000, 0.254, 0.998]) print("d is: " + str(["%.8f" % elem for elem in y_hat])) print("p is: " + str(["%.8f" % elem for elem in y_true])) def mae(predictions, targets): differences = predictions - targets absolute_differences = np.absolute(differences) mean_absolute_differences = absolute_differences.mean() return mean_absolute_differences mae_val = mae(y_hat, y_true) print ("mae error is: " + str(mae_val)) |

1 2 3 4 5 | #d is: ['0.00000000', '0.16600000', '0.33300000'] #p is: ['0.00000000', '0.25400000', '0.99800000'] #mae error is: 0.251 |

**Mean Bias Error**

This is much less common in machine learning domain as compared to its counterpart. This is similar to MSE with the only difference that we don’t take absolute values. Clearly there’s a need for caution as positive and negative errors could cancel each other out. Although less accurate in practice, it could determine if the model has positive biases or negative biases.

** Mathematical formulation**:

$$MBE = \frac{\sum_{i=1}^n (y_i – \hat{y}_i)}{n}$$

## Wrapping up

There are various factors involved in choosing a loss function for a specific problem such as type of machine learning algorithm chosen, ease of calculating the derivatives and to some degree the percentage of outliers in the data set. Nevertheless, you should at least know which loss functions suitable to a particular problem.