A.I, Data and Software Engineering

One-hot encoding quick note

O

Quickly grasp the concept of one-hot encoding by simple data and coding.

Categorical VS Numerical data

Categorical data are variables that contain label values. For example, A “colour” variable can have values “red“, “green” and “blue“. Here, “red”, “green”, “blue” are labels represented by strings.

Numerical data are variables that contain numeric values. For example, A “age” variable can have values 10, 25, 36.

Why categorical data mater?

Some algorithms can work with categorical data directly, e.g. decision tree. Many other algorithms operate properly on numerical data.

The question is: “can we simply map labels with numbers, like “red” -> 1, “green” -> 2, and so on?”.

The answer is “Yes and … No”.

Why No? – As can be seen after mapping, we have 1 + 1 = 2 for numerical data, but it does not make sense to say “red + red = green”. In other words, we may change the relations between the original variables by doing so. The one-hot coding was introduced to reduce such relations between labels.

What is one-hot encoding?

A one-hot encoding is a representation of categorical variables as binary vectors. It is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)

Wikipedia
ColourCategorical #BinaryGray codeOne-hot
red100000000000001
green200100100000010
blue301001100000100

The table shows an example of how colour is mapped in rows. But to be more intuitive and precise, the value is normally presented in columns before fetching into a model.

Red GreenBlue
100
010
001

The binary variables are often called “dummy variables” in other fields, such as statistics.

Note that, this encoding is a part of data preprocessing but not immediately affect the performance of an algorithm.

Encode data with python

We will create fake label data and use different libraries to transform the data.

Using OneHotEncoder from scikit-learn:

from numpy import array
from sklearn.preprocessing import OneHotEncoder
# fake label data
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

Result:

[[ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]]

Using to_categorical from Keras :

from keras.utils import to_categorical
# define example
data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]
data = array(data)
# one hot encode
encoded = to_categorical(data)
print(encoded)

Result:

[[ 0.  1.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 1.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  1.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]]

1 comment

💬

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Categories