Table of contents

Quickly grasp the concept of one-hot encoding by simple data and coding.

## Categorical VS Numerical data

Categorical data are variables that contain label values. For example, A “* colour*” variable can have values “

*“, “*

**red***” and “*

**green***“. Here, “red”, “green”, “blue” are labels represented by strings.*

**blue**Numerical data are variables that contain numeric values. For example, A “* age*” variable can have values

**10**,

**25**,

**36**.

## Why categorical data mater?

Some algorithms can work with categorical data directly, e.g. decision tree. Many other algorithms operate properly on numerical data.

The question is: “**can we simply map labels with numbers**, **like “red” -> 1, “green” -> 2, and so on?”**.

The answer is **“Yes and … No”**.

Why No? – As can be seen after mapping, we have **$1 + 1 = 2$** for numerical data, but it does not make sense to say **“red + red = green”**. In other words, we may change the relations between the original variables by doing so. The one-hot coding was introduced to reduce such relations between labels.

## What is one-hot encoding?

A one-hot encoding is a representation of categorical variables as binary vectors. It is a group of bits among which the legal combinations of values are only those with

Wikipediaa single high (1) bitandall the others low (0)

Colour | Categorical # | Binary | Gray code | One-hot |
---|---|---|---|---|

red | 1 | 000 | 000 | 00000001 |

green | 2 | 001 | 001 | 00000010 |

blue | 3 | 010 | 011 | 00000100 |

The table shows an example of how colour is mapped in rows. But to be more intuitive and precise, the value is normally presented in columns before fetching into a model.

Red | Green | Blue | … |
---|---|---|---|

1 | 0 | 0 | |

0 | 1 | 0 | |

0 | 0 | 1 |

The binary variables are often called “dummy variables” in other fields, such as statistics.

Note that, this encoding is a part of data preprocessing but not immediately affect the performance of an algorithm.

## Encode data with python

We will create fake label data and use different libraries to transform the data.

### Using OneHotEncoder from scikit-learn:

1 2 3 4 5 6 7 8 9 10 | from numpy import array from sklearn.preprocessing import OneHotEncoder # fake label data data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot'] values = array(data) # binary encode onehot_encoder = OneHotEncoder(sparse=False) integer_encoded = integer_encoded.reshape(len(integer_encoded), 1) onehot_encoded = onehot_encoder.fit_transform(integer_encoded) print(onehot_encoded) |

Result:

1 2 3 4 5 6 7 8 9 10 | [[ 1. 0. 0.] [ 1. 0. 0.] [ 0. 0. 1.] [ 1. 0. 0.] [ 0. 1. 0.] [ 0. 1. 0.] [ 0. 0. 1.] [ 1. 0. 0.] [ 0. 0. 1.] [ 0. 1. 0.]] |

### Using to_categorical from Keras :

1 2 3 4 5 6 7 | from keras.utils import to_categorical # define example data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1] data = array(data) # one hot encode encoded = to_categorical(data) print(encoded) |

Result:

1 2 3 4 5 6 7 8 9 10 | [[ 0. 1. 0. 0.] [ 0. 0. 0. 1.] [ 0. 0. 1. 0.] [ 1. 0. 0. 0.] [ 0. 0. 0. 1.] [ 0. 0. 1. 0.] [ 0. 0. 1. 0.] [ 0. 1. 0. 0.] [ 1. 0. 0. 0.] [ 0. 1. 0. 0.]] |