This post will demonstrate onehot encoding for a rating matrix, such as movie lens dataset.
One-hot encoding
Previously, we introduced a quick note for one-hot encoding. It is a representation of categorical variables as binary vectors. It is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)
Rating matrix
If you are working with recommender system, you will be likely to deal with rating matrices like one below.
item 1 | item 2 | |
user 1 | 5 | 2 |
user 2 | 1 | 4 |
There are two categorical data in the table, i.e. user id and item id. We can perform two simple steps:
- Convert categorical data to numbers
- Encode numbers to one-hot format
Sometimes, the ids are already numerical. In that case, we can skip step 1. Now, we will create a new table that encode both the information.
Loading movielens data
We will use ml100k
dataset and load rating data from file u.data
.
rdata= pd.read_csv(dt_dir_name +'/'+ 'u.data', delimiter='\t', names=['userId', 'movieId', 'rating', 'timestamp'])
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
Now, we extract the userId and movieId.
uids = rdata['userId']#.drop_duplicates().sort_values()
iids = rdata['movieId']
uids.shape, iids.shape
#((100000,), (100000,))
The data has a hundred thousand records for both movie ids and user ids. Next, we encode userId with keras.utils
‘s to_categorical
.
%tensorflow_version 2.x
from keras.utils import to_categorical
# one hot encode
encoded_uids = to_categorical(uids)
encoded_iids = to_categorical(iids)
#TensorFlow 2.x selected.
#Using TensorFlow backend.
Finally, we create the encoded data frame as follows.
#append
encodedRdata = pd.concat([pd.DataFrame(encoded_uids), pd.DataFrame(encoded_iids), rdata['rating']], axis=1)
0 | 1 | 2 | 3 | … | 1654 | 1655 | 1656 | 1657 | 1658 | 1659 | 1660 | 1661 | 1662 | 1663 | 1664 | 1665 | 1666 | 1667 | 1668 | 1669 | 1670 | 1671 | 1672 | 1673 | 1674 | 1675 | 1676 | 1677 | 1678 | 1679 | 1680 | 1681 | 1682 | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
99998 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
99999 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
The process significantly increase the size of the table, from (100k x
944) to (100kx
2628). If you want to convert the encoded data to integer:
encodedRdata = encodedRdata.astype(int)
To sum up
This post demonstrates an easy way to pre-process data, such as movie lens. We use to_categorical
of keras to one-hot encode the data. The technique can be applied to similar problems.