build a simple recommender system with matrix factorization

We will build a recommender system which recommends top n items for a user using the matrix factorization technique- one of the three most popular used recommender systems.

matrix factorization

Suppose we have a rating matrix of m users and n items. The rating of user $u_i$ to item $i_j$ is $r_{ij}$ .

Similar to PCA, matrix factorization (MF) technique attempts to decompose a (very) large matrix ( $m \times n$ ) to smaller matrices (e.g. $m\times k \text{ and } k \times$ . While PCA requires a matrix with no missing values, MF can overcome that by first filling the missing values.

Latent factors in MF

The two decomposed matrix have smaller dimensions compared to the original one. Before applying MF, you need to choose the value for the dimension k of the decomposed matrices. k is known as the number of latent factors.

The intuition of this is there are some unknown factors (k) that influence the rating of users to items. The good thing is we don’t have to tell what exactly these factors are. MF will use the value of k to generate 2 matrices, aka, user and item embedding matrices.

MF with Keras

We implement MF with Keras and TF.2.0 with Movielens dataset. You can refer to this article for movie lens download and process. In this article, I will reuse some script from that for downloading the dataset.

from sklearn.datasets import dump_svmlight_file
import numpy as np
import pandas as pd
import os
import urllib
import zipfile
from sklearn.model_selection import train_test_split
import shutil
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

The datasets’ urls are as follows:

datasets = {'ml100k':'http://files.grouplens.org/datasets/movielens/ml-100k.zip',
            'ml20m':'http://files.grouplens.org/datasets/movielens/ml-20m.zip',
            'mllatestsmall':'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip',
            'ml10m':'http://files.grouplens.org/datasets/movielens/ml-10m.zip',
            'ml1m':'http://files.grouplens.org/datasets/movielens/ml-1m.zip'
            }

dt_name = os.path.basename(datasets[dt])
print('Downloading {}'.format(dt_name))
with urllib.request.urlopen(datasets[dt]) as response, open('./sample_data/'+dt_name, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)
print('Download completed')

#Downloading ml-100k.zip
#Download completed

Next, we extract and load data to a data frame:

dataset = pd.read_csv(dt_dir_name+"/u.data",sep='\t',names="user_id,item_id,rating,timestamp".split(","))

	user_id	item_id	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

The data set contains 943 users and 1682 items. We can reindex the users and items from 0 (the first index) instead of 1. The original indices will be reduced by one.

dataset.user_id = dataset.user_id.astype('category').cat.codes.values
dataset.item_id = dataset.item_id.astype('category').cat.codes.values

	user_id	item_id	rating	timestamp
0	195	241	3	881250949
1	185	301	3	891717742
2	21	376	1	878887116
3	243	50	2	880606923
4	165	345	1	886397596

Next, we create train and test sets with 80% and 20% of the original dataset respectively.

train, test = train_test_split(dataset, test_size=0.2)

Let say we select the number of latent factors as 20. You may try with other numbers, e.g. 3, 5 or 10.

%tensorflow_version 2.x
import tensorflow as tf
from tensorflow import keras
from keras.optimizers import Adam

#TensorFlow 2.x selected.
#Using TensorFlow backend.

n_users, n_movies = len(dataset.user_id.unique()), len(dataset.item_id.unique())
n_latent_factors = 20

movie_input = keras.layers.Input(shape=[1],name='Item')
movie_embedding = keras.layers.Embedding(n_movies + 1, n_latent_factors, name='Movie-Embedding')(movie_input)
movie_vec = keras.layers.Flatten(name='FlattenMovies')(movie_embedding)
user_input = keras.layers.Input(shape=[1],name='User')
user_vec = keras.layers.Flatten(name='FlattenUsers')(keras.layers.Embedding(n_users + 1, n_latent_factors,name='User-Embedding')(user_input))
prod = keras.layers.dot([movie_vec, user_vec], axes=1,name='DotProduct')
model = keras.Model([user_input, movie_input], prod)

We compile the model and also monitor two error type, namely, mean absolute error (MAE), and mean squared error (MSE).

model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae', 'mse'])

The model is summarized as below.

model.summary()

Model: "model"
_____________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
=====================================================================================
Item (InputLayer)               [(None, 1)]          0
_____________________________________________________________________________________
User (InputLayer)               [(None, 1)]          0
_____________________________________________________________________________________
Movie-Embedding (Embedding)     (None, 1, 20)        33660       Item[0][0]
_____________________________________________________________________________________
User-Embedding (Embedding)      (None, 1, 20)        18880       User[0][0]
_____________________________________________________________________________________
FlattenMovies (Flatten)         (None, 20)           0           Movie-Embedding[0][0]
_____________________________________________________________________________________
FlattenUsers (Flatten)          (None, 20)           0           User-Embedding[0][0]
_____________________________________________________________________________________
DotProduct (Dot)                (None, 1)            0           FlattenMovies[0][0]
                                                                 FlattenUsers[0][0]
=====================================================================================
Total params: 52,540
Trainable params: 52,540
Non-trainable params: 0

Visualise the model using Keras utils’ plot_model:

tf.keras.utils.plot_model(model, to_file='model.png')

Great tool! Now it is time to train our model and log the history:

history = model.fit([train.user_id, train.item_id], train.rating, epochs=100, verbose=0)

pd.Series(history.history['loss']).plot(logy=True)
plt.xlabel("Epoch")
plt.ylabel("Training Error")

We now evaluate our model. First, we generate the ratings for each user and item pair on the test set and then we calculate the error.


results = model.evaluate((test.user_id, test.item_id), test.rating, batch_size=1)

We have some results from different settings. Remember that the errors are measured based on [1, .., 5] rating scale.

#20 hidden factors
20000/20000 [==============================] - 54s 3ms/sample - loss: 1.6322 - mae: 0.9582 - mse: 1.6322

#10 hidden factors
20000/20000 [==============================] - 53s 3ms/sample - loss: 1.1858 - mae: 0.8259 - mse: 1.1858

#5 hidden factors
20000/20000 [==============================] - 52s 3ms/sample - loss: 0.9430 - mae: 0.7500 - mse: 0.9430

Learnt Embedding

We now can obtain two embedding matrices for users and items.

movie_embedding_learnt = model.get_layer(name='Movie-Embedding').get_weights()[0]
pd.DataFrame(movie_embedding_learnt).describe()

	0	1	2	3	4
count	1683.000000	1683.000000	1683.000000	1683.000000	1683.000000
mean	0.774399	0.679642	-0.713351	0.731147	0.647028
std	0.504034	0.491500	0.561679	0.464591	0.519102
min	-2.043083	-0.980162	-3.440306	-1.761205	-1.063968
25%	0.441313	0.367185	-1.112636	0.425561	0.278499
50%	0.772326	0.683421	-0.722607	0.723500	0.656169
75%	1.096993	1.008840	-0.337775	1.020044	1.019403
max	2.922819	2.663551	1.664768	2.312259	2.171595

user_embedding_learnt = model.get_layer(name='User-Embedding').get_weights()[0]

array([[ 0.178934  ,  0.98884964, -1.4177339 ,  0.50673306,  1.2531797 ],
       [ 0.41552344,  0.9153664 , -1.280103  ,  0.88151026,  1.0151937 ],
       [ 0.11478277,  0.41585183, -0.57295203,  1.4692334 ,  1.3177701 ],
       ...,
       [ 1.1516297 ,  1.072977  , -0.47597128,  1.1390864 ,  1.0125358 ],
       [-0.09381651,  1.7068275 , -0.5006427 ,  1.7247322 ,  0.05102845],
       [ 0.02292876, -0.01486804,  0.02708695,  0.04261862,  0.02596695]],
      dtype=float32)

How to Recommend?

I believe beginners will have a doubt about why we are creating these matrices. What is the use of these matrices we have spent so much time understanding?

To recommend top n items to a user $u_i$ is simple now. We take the embedding vector of the user and do a dot product with all the embedding vectors of movies and get the top n largest values. The following code returns the top 5 most relevant movie ids.

def recommend(user_id, number_of_movies=5):
  movies = user_embedding_learnt[user_id]@movie_embedding_learnt.T
  mids = np.argpartition(movies, -number_of_movies)[-number_of_movies:]
  return mids

Now, we recommend 5 movies (ids) for user_id=1

recommend(user_id=1)
#array([1466, 1305, 1388, 1535, 1448])

Conclusion

This post revisits a simple recommender system with matrix factorization using Keras. Nevertheless, embedding matrices have some negative values. There are some applications which require that the learnt embeddings be non-negative which we will address in another post.

Cancel reply

Premobowei Miriki says:
March 23, 2021 at 10:38 am

Great article very interesting I do have a few questions though. Firstly why did you have to use the reindex what difference does it make to the results and wouldn’t that affect how it matches up to the actual movies dataset. Also, I wanted to ask why you used the dot product for the matrix factorization rather than multiplication. Thank you.
- tungnd says:
  March 23, 2021 at 3:25 pm
  
  Q1: The reindexing is to make it consistent with the python indexing convention which starts from 0 rather than 1 so that it can avoid possible errors.
  Q2: We need to compare a number to a number, i.e. rating value ( 5 stars) to predicted value (4.x something) for each user to each movie. So Dot product generates a number and applies for 2 vectors, while matrix multiplication does not serve that purpose.
Generate data on the fly - Keras data generator - Petamind says:
October 5, 2021 at 11:20 pm

[…] we train our model using the pre-generated dataset, for example, in the recommender system or recurrent neural network. In this article, we will demonstrate using a generator to produce data […]