Movie lens is a popular collection of datasets for recommender systems. This post introduces a python script to process the movie lens datasets, generate a negative sample, and transforms the datasets into SVM light format. The format is also known as
libfm format used in many factorization machines.
Movie Lens datasets
MovieLens is a web-based recommender system and virtual community that recommends movies for its users to watch, based on their film preferences using collaborative filtering of members’ movie ratings and movie reviews. It contains about 11 million ratings for about 8500 movies. MovieLens was created in 1997 by GroupLens Research, a research lab in the Department of Computer Science and Engineering at the University of Minnesota, in order to gather research data on personalized recommendations.
GroupLens Research has made available rating data sets from the MovieLens web site (http://movielens.org). The team has been collecting data sets over various periods of time, with different sizes. Currently, there are about 10 datasets available to download. There are slight difference between these sets w.r.t their formats, e.g. delimiters.
Process and export
In this section, we will first download a dataset, get statistics. We then create a negative sample of the dataset which is useful for learning using implicit data. After that, we split data into train, test, and validation and export them into different formats.
I create a jupyter notebook as follows:
If you start research on recommender systems, Movie lens datasets are good choices. They contain mainly explicit data, such as ratings. Nevertheless, we can transform some of them into implicit data such as tag data. Hope you find this Movie Lens Automation useful. 🙂