A.I, Data and Software Engineering

One-hot encoding matrices demonstration

O

This post will demonstrate onehot encoding for a rating matrix, such as movie lens dataset.

One-hot encoding

Previously, we introduced a quick note for one-hot encoding. It is a representation of categorical variables as binary vectors. It is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)

Rating matrix

If you are working with recommender system, you will be likely to deal with rating matrices like one below.

item 1item 2
user 152
user 214

There are two categorical data in the table, i.e. user id and item id. We can perform two simple steps:

  1. Convert categorical data to numbers
  2. Encode numbers to one-hot format

Sometimes, the ids are already numerical. In that case, we can skip step 1. Now, we will create a new table that encode both the information.

Loading movielens data

We will use ml100k dataset and load rating data from file u.data.

rdata= pd.read_csv(dt_dir_name +'/'+ 'u.data', delimiter='\t', names=['userId', 'movieId', 'rating', 'timestamp'])
userIdmovieIdratingtimestamp
01962423881250949
11863023891717742
2223771878887116
3244512880606923
41663461886397596

Now, we extract the userId and movieId.

uids = rdata['userId']#.drop_duplicates().sort_values()
iids = rdata['movieId']
uids.shape, iids.shape
#((100000,), (100000,))

The data has a hundred thousand records for both movie ids and user ids. Next, we encode userId with keras.utils‘s to_categorical.

%tensorflow_version 2.x
from keras.utils import to_categorical
# one hot encode
encoded_uids = to_categorical(uids)
encoded_iids = to_categorical(iids)
#TensorFlow 2.x selected.
#Using TensorFlow backend.

Finally, we create the encoded data frame as follows.

#append
encodedRdata = pd.concat([pd.DataFrame(encoded_uids), pd.DataFrame(encoded_iids), rdata['rating']], axis=1)
012316541655165616571658165916601661166216631664166516661667166816691670167116721673167416751676167716781679168016811682rating
00000000000000000000000000000000003
999980000000000000000000000000000000002
999990000000000000000000000000000000003

The process significantly increase the size of the table, from (100k x 944) to (100kx2628). If you want to convert the encoded data to integer:

encodedRdata = encodedRdata.astype(int)

To sum up

This post demonstrates an easy way to pre-process data, such as movie lens. We use to_categorical of keras to one-hot encode the data. The technique can be applied to similar problems.

Add comment

💬

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Categories