A.I, Data and Software Engineering

Movielens automation- Process and export

M

Movie lens is a popular collection of datasets for recommender systems. This post introduces a python script to process the movie lens datasets, generate a negative sample, and transforms the datasets into SVM light format. The format is also known as libfm format used in many factorization machines.

Movie Lens datasets

Movie lens on Amazon
Movie lens icon on Amazon

MovieLens is a web-based recommender system and virtual community that recommends movies for its users to watch, based on their film preferences using collaborative filtering of members’ movie ratings and movie reviews. It contains about 11 million ratings for about 8500 movies. MovieLens was created in 1997 by GroupLens Research, a research lab in the Department of Computer Science and Engineering at the University of Minnesota, in order to gather research data on personalized recommendations.

GroupLens Research has made available rating data sets from the MovieLens web site (http://movielens.org). The team has been collecting data sets over various periods of time, with different sizes. Currently, there are about 10 datasets available to download. There are slight difference between these sets w.r.t their formats, e.g. delimiters.

Process and export

In this section, we will first download a dataset, get statistics. We then create a negative sample of the dataset which is useful for learning using implicit data. After that, we split data into train, test, and validation and export them into different formats.

I create a jupyter notebook as follows:

Conclusion

If you start research on recommender systems, Movie lens datasets are good choices. They contain mainly explicit data, such as ratings. Nevertheless, we can transform some of them into implicit data such as tag data. Hope you find this Movie Lens Automation useful. 🙂

3 comments

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Categories