Quickly grasp the concept of one-hot encoding by simple data and coding.

Contents

## Categorical VS Numerical data

Categorical data are variables that contain label values. For example, A “* colour*” variable can have values “

*“, “*

**red***” and “*

**green***“. Here, “red”, “green”, “blue” are labels represented by strings.*

**blue**Numerical data are variables that contain numeric values. For example, A “* age*” variable can have values

**10**,

**25**,

**36**.

## Why categorical data mater?

Some algorithms can work with categorical data directly, e.g. decision tree. Many other algorithms operate properly on numerical data.

The question is: “**can we simply map labels with numbers**, **like “red” -> 1, “green” -> 2, and so on?”**.

The answer is **“Yes and … No”**.

Why No? – As can be seen after mapping, we have for numerical data, but it does not make sense to say

**“red + red = green”**. In other words, we may change the relations between the original variables by doing so. The one-hot coding was introduced to reduce such relations between labels.

## What is one-hot encoding?

A one-hot encoding is a representation of categorical variables as binary vectors. It is a group of bits among which the legal combinations of values are only those with

Wikipediaa single high (1) bitandall the others low (0)

Colour | Categorical # | Binary | Gray code | One-hot |
---|---|---|---|---|

red | 1 | 000 | 000 | 00000001 |

green | 2 | 001 | 001 | 00000010 |

blue | 3 | 010 | 011 | 00000100 |

The table shows an example of how colour is mapped in rows. But to be more intuitive and precise, the value is normally presented in columns before fetching into a model.

Red | Green | Blue | … |
---|---|---|---|

1 | 0 | 0 | |

0 | 1 | 0 | |

0 | 0 | 1 |

The binary variables are often called “dummy variables” in other fields, such as statistics.

Note that, this encoding is a part of data preprocessing but not immediately affect the performance of an algorithm.

## Encode data with python

We will create fake label data and use different libraries to transform the data.

### Using OneHotEncoder from scikit-learn:

```
from numpy import array
from sklearn.preprocessing import OneHotEncoder
# fake label data
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
```

Result:

```
[[ 1. 0. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]
[ 0. 0. 1.]
[ 0. 1. 0.]]
```

### Using to_categorical from Keras :

```
from keras.utils import to_categorical
# define example
data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]
data = array(data)
# one hot encode
encoded = to_categorical(data)
print(encoded)
```

Result:

```
[[ 0. 1. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 1. 0.]
[ 1. 0. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 1. 0.]
[ 0. 0. 1. 0.]
[ 0. 1. 0. 0.]
[ 1. 0. 0. 0.]
[ 0. 1. 0. 0.]]
```

[…] The column "Origin" is really categorical (not numeric). To eliminate the linear relations between them, we convert that to a one-hot: […]