One-hot encoding matrices demonstration

This post will demonstrate onehot encoding for a rating matrix, such as movie lens dataset.

One-hot encoding

Previously, we introduced a quick note for one-hot encoding. It is a representation of categorical variables as binary vectors. It is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)

Rating matrix

If you are working with recommender system, you will be likely to deal with rating matrices like one below.

	item 1	item 2
user 1	5	2
user 2	1	4

There are two categorical data in the table, i.e. user id and item id. We can perform two simple steps:

Convert categorical data to numbers
Encode numbers to one-hot format

Sometimes, the ids are already numerical. In that case, we can skip step 1. Now, we will create a new table that encode both the information.

Loading movielens data

We will use ml100k dataset and load rating data from file u.data.

rdata= pd.read_csv(dt_dir_name +'/'+ 'u.data', delimiter='\t', names=['userId', 'movieId', 'rating', 'timestamp'])

	userId	movieId	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

Now, we extract the userId and movieId.

uids = rdata['userId']#.drop_duplicates().sort_values()
iids = rdata['movieId']
uids.shape, iids.shape
#((100000,), (100000,))

The data has a hundred thousand records for both movie ids and user ids. Next, we encode userId with keras.utils‘s to_categorical.

%tensorflow_version 2.x
from keras.utils import to_categorical
# one hot encode
encoded_uids = to_categorical(uids)
encoded_iids = to_categorical(iids)
#TensorFlow 2.x selected.
#Using TensorFlow backend.

Finally, we create the encoded data frame as follows.

#append
encodedRdata = pd.concat([pd.DataFrame(encoded_uids), pd.DataFrame(encoded_iids), rdata['rating']], axis=1)

	0	1	2	3	…	1654	1655	1656	1657	1658	1659	1660	1661	1662	1663	1664	1665	1666	1667	1668	1669	1670	1671	1672	1673	1674	1675	1676	1677	1678	1679	1680	1681	1682	rating
0	0	0	0	0	…	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
99998	0	0	0	0	…	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2
99999	0	0	0	0	…	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3

The process significantly increase the size of the table, from (100k x 944) to (100kx2628). If you want to convert the encoded data to integer:

encodedRdata = encodedRdata.astype(int)

To sum up

This post demonstrates an easy way to pre-process data, such as movie lens. We use to_categorical of keras to one-hot encode the data. The technique can be applied to similar problems.

encode movie lens one-hot python

One-hot encoding matrices demonstration

One-hot encoding

Rating matrix

Loading movielens data

To sum up

Add comment

Cancel reply

One-hot encoding

Rating matrix

Loading movielens data

To sum up

Add comment

Cancel reply

Read more

Categories