One-hot encoding quick note

Quickly grasp the concept of one-hot encoding by simple data and coding.

Categorical VS Numerical data

Categorical data are variables that contain label values. For example, A “colour” variable can have values “red“, “green” and “blue“. Here, “red”, “green”, “blue” are labels represented by strings.

Numerical data are variables that contain numeric values. For example, A “age” variable can have values 10, 25, 36.

Why categorical data mater?

Some algorithms can work with categorical data directly, e.g. decision tree. Many other algorithms operate properly on numerical data.

The question is: “can we simply map labels with numbers, like “red” -> 1, “green” -> 2, and so on?”.

The answer is “Yes and … No”.

Why No? – As can be seen after mapping, we have $1 + 1 = 2$ for numerical data, but it does not make sense to say “red + red = green”. In other words, we may change the relations between the original variables by doing so. The one-hot coding was introduced to reduce such relations between labels.

What is one-hot encoding?

A one-hot encoding is a representation of categorical variables as binary vectors. It is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)
Wikipedia

Colour	Categorical #	Binary	Gray code	One-hot
red	1	000	000	00000001
green	2	001	001	00000010
blue	3	010	011	00000100

The table shows an example of how colour is mapped in rows. But to be more intuitive and precise, the value is normally presented in columns before fetching into a model.

Red	Green	Blue	…
1	0	0
0	1	0
0	0	1

The binary variables are often called “dummy variables” in other fields, such as statistics.

Note that, this encoding is a part of data preprocessing but not immediately affect the performance of an algorithm.

Encode data with python

We will create fake label data and use different libraries to transform the data.

Using OneHotEncoder from scikit-learn:

from numpy import array
from sklearn.preprocessing import OneHotEncoder
# fake label data
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

Result:

[[ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]]

Using to_categorical from Keras :

from keras.utils import to_categorical
# define example
data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]
data = array(data)
# one hot encode
encoded = to_categorical(data)
print(encoded)

Result:

[[ 0.  1.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 1.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  1.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]]

keras sklearn

One-hot encoding quick note

Categorical VS Numerical data

Why categorical data mater?

What is one-hot encoding?

Encode data with python

Using OneHotEncoder from scikit-learn:

Using to_categorical from Keras :

1 comment

Cancel reply

Categorical VS Numerical data

Why categorical data mater?

What is one-hot encoding?

Encode data with python

Using OneHotEncoder from scikit-learn:

Using to_categorical from Keras :

1 comment

Cancel reply

Read more

Categories