A.I, Data and Software Engineering

One-hot encoding quick note

O

Quickly grasp the concept of one-hot encoding by simple data and coding.

Categorical VS Numerical data

Categorical data are variables that contain label values. For example, A “colour” variable can have values “red“, “green” and “blue“. Here, “red”, “green”, “blue” are labels represented by strings.

Numerical data are variables that contain numeric values. For example, A “age” variable can have values 10, 25, 36.

Why categorical data mater?

Some algorithms can work with categorical data directly, e.g. decision tree. Many other algorithms operate properly on numerical data.

The question is: “can we simply map labels with numbers, like “red” -> 1, “green” -> 2, and so on?”.

The answer is “Yes and … No”.

Why No? – As can be seen after mapping, we have $1 + 1 = 2$ for numerical data, but it does not make sense to say “red + red = green”. In other words, we may change the relations between the original variables by doing so. The one-hot coding was introduced to reduce such relations between labels.

What is one-hot encoding?

A one-hot encoding is a representation of categorical variables as binary vectors. It is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)

Wikipedia
ColourCategorical #BinaryGray codeOne-hot
red100000000000001
green200100100000010
blue301001100000100

The table shows an example of how colour is mapped in rows. But to be more intuitive and precise, the value is normally presented in columns before fetching into a model.

RedGreenBlue
100
010
001

The binary variables are often called “dummy variables” in other fields, such as statistics.

Note that, this encoding is a part of data preprocessing but not immediately affect the performance of an algorithm.

Encode data with python

We will create fake label data and use different libraries to transform the data.

Using OneHotEncoder from scikit-learn:

Result:

Using to_categorical from Keras :

Result:

Add comment

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Pin It on Pinterest

Newsletters

You have successfully subscribed to the newsletter

There was an error while trying to send your request. Please try again.

Petaminds will use the information you provide on this form to be in touch with you and to provide updates.