A.I, Data and Software Engineering

Fast uniform negative sampling for rating matrix

F

Sometimes, we want to reduce the training time by using a subset of a very large dataset while the negative samples outnumbers the positive ones, e.g. word embedding. Another situation when we deal with implicit data. In this case, we may need to populate new data for negative values. This post demonstrates how to generate data for training using uniform negative sampling.

The data

Originally, the rating matrix tells who rated which items. Now we want to get the data that tell who interacted with which items. The interaction does not tell the user like or dislike an item. Doing so means that we transform a rating matrix (explicit data) into an implicit dataset.

rating matrix and negative sampling
Rating matrix – collaborative filtering (SRC: WIKIPEDIA)

If we consider interaction has value 1 and 0 otherwise, then the original rating data will become all 1s. So, you can see that with only 1s in the label, the model cannot distinguish between interact and not interact as shown in the following tables.

userIdmovieIdratingtimestamp
011225.0838985046
111853838983525
2562312838983392
33229254838983421
4353167838983392

This is the interaction matrix:

userIdmovieIdratingtimestamp
011221838985046
111851838983525
2562311838983392
3322921838983421
4353161838983392

The sampling method

We want to get n negative samples per one positive. The naive method can be:

  1. Loop through all user ids
  2. For each user id, get a random item id and check if the pair user-item does not exist in the dataset
  3. Add the found user-item as a negative sample to the dataset.

If you follow these steps, then you may find the execution times can be really long (~20 mins). To accelerate the result, we utilize some useful libraries as follows:

  1. Generate a dense matrix from the dataset using scipy,
    • rows and cols are users and items
  2. For each row, extract krandom items’ indices of 0 values using random.samples
    • k is the number of non zero values in that row.
  3. Append the list of user-item from k extract index and append to the dataset.

Python Implementation

Result

Wrapping up

Negative sampling is an efficient method to reduce the training time of imbalanced large dataset. The introduced method, i.e neg_sampling(…), can uniformly sample negative values. 2 million rating records can be generated within ~2 seconds (600 times faster than the naive method).

Add comment

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Pin It on Pinterest

Newsletters

You have successfully subscribed to the newsletter

There was an error while trying to send your request. Please try again.

Petaminds will use the information you provide on this form to be in touch with you and to provide updates.