Sometimes, we want to reduce the training time by using a subset of a very large dataset while the negative samples outnumbers the positive ones, e.g. word embedding. Another situation when we deal with implicit data. In this case, we may need to populate new data for negative values. This post demonstrates how to generate data for training using uniform negative sampling.
The data
Originally, the rating matrix tells who rated which items. Now we want to get the data that tell who interacted with which items. The interaction does not tell the user like or dislike an item. Doing so means that we transform a rating matrix (explicit data) into an implicit dataset.

If we consider interaction has value 1 and 0 otherwise, then the original rating data will become all 1s. So, you can see that with only 1s in the label, the model cannot distinguish between interact
and not interact
as shown in the following tables.
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 122 | 5.0 | 838985046 |
1 | 1 | 185 | 3 | 838983525 |
2 | 56 | 231 | 2 | 838983392 |
3 | 32 | 292 | 54 | 838983421 |
4 | 35 | 316 | 7 | 838983392 |
This is the interaction matrix:
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 122 | 1 | 838985046 |
1 | 1 | 185 | 1 | 838983525 |
2 | 56 | 231 | 1 | 838983392 |
3 | 32 | 292 | 1 | 838983421 |
4 | 35 | 316 | 1 | 838983392 |
The sampling method
We want to get n
negative samples per one positive. The naive method can be:
- Loop through all
user ids
- For each
user id
, get a randomitem id
and check if the pairuser-item
does not exist in the dataset - Add the found
user-item
as a negative sample to the dataset.
If you follow these steps, then you may find the execution times can be really long (~20 mins). To accelerate the result, we utilize some useful libraries as follows:
- Generate a dense matrix from the dataset using
scipy
,- rows and cols are users and items
- For each row, extract
k
random items’ indices of 0 values usingrandom.samples
-
k
is the number of non zero values in that row.
-
- Append the list of
user-item
fromk
extract index and append to the dataset.
Python Implementation
import random
import time
import scipy
def neg_sampling(ratings_df, n_neg=1, neg_val=0, pos_val=1, percent_print=5):
"""version 1.2: 1 positive 1 neg (2 times bigger than the original dataset by default)
Parameters:
input rating data as pandas dataframe: userId|movieId|rating
n_neg: take n_negative / 1 positive
Returns:
negative sampled set as pandas dataframe
userId|movieId|interact (implicit)
"""
ratings_df.userId = ratings_df.userId.astype('category').cat.codes.values
ratings_df.movieId = ratings_df.movieId.astype('category').cat.codes.values
sparse_mat = scipy.sparse.coo_matrix((ratings_df.rating, (ratings_df.userId, ratings_df.movieId)))
dense_mat = np.asarray(sparse_mat.todense())
print(dense_mat.shape)
nsamples = ratings_df[['userId', 'movieId']]
nsamples['interact'] = nsamples.apply(lambda row: 1, axis=1)
length = dense_mat.shape[0]
printpc = int(length * percent_print/100)
nTempData = []
i = 0
start_time = time.time()
stop_time = time.time()
extra_samples = 0
for row in dense_mat:
if(i%printpc==0):
stop_time = time.time()
print("processed ... {0:0.2f}% ...{1:0.2f}secs".format(float(i)*100 / length, stop_time - start_time))
start_time = stop_time
n_non_0 = len(np.nonzero(row)[0])
zero_indices = np.where(row==0)[0]
if(n_non_0 * n_neg + extra_samples > len(zero_indices)):
print(i, "non 0:", n_non_0,": len ",len(zero_indices))
neg_indices = zero_indices.tolist()
extra_samples = n_non_0 * n_neg + extra_samples - len(zero_indices)
else:
neg_indices = random.sample(zero_indices.tolist(), n_non_0 * n_neg + extra_samples)
extra_samples = 0
nTempData.extend([(uu, ii, rr) for (uu, ii, rr) in zip(np.repeat(i, len(neg_indices))
, neg_indices, np.repeat(neg_val, len(neg_indices)))])
i+=1
nsamples=nsamples.append(pd.DataFrame(nTempData, columns=["userId","movieId", "interact"]),ignore_index=True)
return nsamples
Result
(69878, 10677)
processed ... 0.00% ...0.00secs
processed ... 5.00% ...0.07secs
processed ... 10.00% ...0.07secs
processed ... 15.00% ...0.08secs
processed ... 20.00% ...0.08secs
processed ... 25.00% ...0.08secs
processed ... 30.00% ...0.08secs
processed ... 35.00% ...0.08secs
processed ... 40.00% ...0.07secs
processed ... 45.00% ...0.07secs
processed ... 50.00% ...0.08secs
processed ... 55.00% ...0.08secs
processed ... 60.00% ...0.08secs
processed ... 65.00% ...0.08secs
4168 non 0: 2314 : len 1392
processed ... 70.00% ...0.08secs
processed ... 75.00% ...0.09secs
processed ... 80.00% ...0.08secs
processed ... 85.00% ...0.08secs
processed ... 90.00% ...0.07secs
processed ... 95.00% ...0.07secs
done: (20000108, 3)
Wrapping up
Negative sampling is an efficient method to reduce the training time of imbalanced large dataset. The introduced method, i.e neg_sampling
(…), can uniformly sample negative values. 2 million rating records can be generated within ~2 seconds (600 times faster than the naive method).
[…] purposes, we use the dataset generated from negative samples using the technique mentioned in this post. The data contain user_id, item_id, and interaction (0-non-interact, 1 – has interact). The […]