Quickly grasp the concept of one-hot encoding by simple data and coding.
Categorical VS Numerical data
Categorical data are variables that contain label values. For example, A “colour” variable can have values “red“, “green” and “blue“. Here, “red”, “green”, “blue” are labels represented by strings.
Numerical data are variables that contain numeric values. For example, A “age” variable can have values 10, 25, 36.
Why categorical data mater?
Some algorithms can work with categorical data directly, e.g. decision tree. Many other algorithms operate properly on numerical data.
The question is: “can we simply map labels with numbers, like “red” -> 1, “green” -> 2, and so on?”.
The answer is “Yes and … No”.
Why No? – As can be seen after mapping, we have for numerical data, but it does not make sense to say “red + red = green”. In other words, we may change the relations between the original variables by doing so. The one-hot coding was introduced to reduce such relations between labels.
What is one-hot encoding?
A one-hot encoding is a representation of categorical variables as binary vectors. It is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)
Wikipedia
Colour | Categorical # | Binary | Gray code | One-hot |
---|---|---|---|---|
red | 1 | 000 | 000 | 00000001 |
green | 2 | 001 | 001 | 00000010 |
blue | 3 | 010 | 011 | 00000100 |
The table shows an example of how colour is mapped in rows. But to be more intuitive and precise, the value is normally presented in columns before fetching into a model.
Red | Green | Blue | … |
---|---|---|---|
1 | 0 | 0 | |
0 | 1 | 0 | |
0 | 0 | 1 |
The binary variables are often called “dummy variables” in other fields, such as statistics.
Note that, this encoding is a part of data preprocessing but not immediately affect the performance of an algorithm.
Encode data with python
We will create fake label data and use different libraries to transform the data.
Using OneHotEncoder from scikit-learn:
from numpy import array
from sklearn.preprocessing import OneHotEncoder
# fake label data
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
Result:
[[ 1. 0. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]
[ 0. 0. 1.]
[ 0. 1. 0.]]
Using to_categorical from Keras :
from keras.utils import to_categorical
# define example
data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]
data = array(data)
# one hot encode
encoded = to_categorical(data)
print(encoded)
Result:
[[ 0. 1. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 1. 0.]
[ 1. 0. 0. 0.]
[ 0. 0. 0. 1.]
[ 0. 0. 1. 0.]
[ 0. 0. 1. 0.]
[ 0. 1. 0. 0.]
[ 1. 0. 0. 0.]
[ 0. 1. 0. 0.]]
[…] The column "Origin" is really categorical (not numeric). To eliminate the linear relations between them, we convert that to a one-hot: […]