Table of contents
Previously, we train our model using the pre-generated dataset, for example, in the recommender system or recurrent neural network. In this article, we will demonstrate using a generator to produce data on the fly for training a model.
Keras Data Generator with
There are a couple of ways to create a data generator. However, Tensorflow Keras provides a base class to fit dataset as a sequence.
To create our own data generator, we need to subclass
tf.keras.utils.Sequence and must implement the
__getitem__ and the
A generator should return a batch including (input, output) for training. This can be achieved by modify the method
__getitem__. The scaffold would look like this.
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
//other code ...
return math.ceil(len(self.x) / self.batch_size)
def __getitem__(self, idx):
//generate X, y...
return X, y
If you want to modify your dataset between epochs you may implement
Train with a generator
After creating a generator, you have two options. One is to use
As the method is deprecated, we can use the same
fit(x=None, y=None, batch_size=15)
Remember that, when
x is a generator, then we leave
y untouch as the output should be included in the batch generated by the generator as shown from the flowing docstring.
x: Input data. It could be A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample weights). A more detailed description of unpacking behavior for iterator types (Dataset, generator, Sequence) is given below.
y: Target data. If x is a dataset, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x).
Supposed that we have a recommender model from this post. Now we create a data generator for training. We use movie lens dataset, you can refer to this post for downloading and parsing the data to a Panda dataframe.
Below is the complete generator class.
def __init__(self, dataset, batch_size=16, dim=(1), shuffle=True):
self.dim = dim
self.batch_size = batch_size
self.dataset = dataset
self.shuffle = shuffle
self.indexes = dataset.index
'Denotes the number of batches per epoch'
return math.ceil(len(self.dataset) / self.batch_size)
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
idxs = [i for i in range(index*self.batch_size,(index+1)*self.batch_size)]
# Find list of IDs
list_IDs_temp = [self.indexes[k] for k in idxs]
# Generate data
User = dataset.loc[list_IDs_temp,['user_id']].to_numpy()#.reshape(-1)
Item = dataset.loc[list_IDs_temp,['item_id']].to_numpy()#.reshape(-1)
y = dataset.loc[list_IDs_temp,['rating']].to_numpy()#.reshape(-1)
#print("u,i,r:", [User, Item],[y])
return [User, Item],[y]
'Updates indexes after each epoch'
self.indexes = np.arange(len(dataset))
if self.shuffle == True:
Some important note:
- The recommender model takes 2 inputs and produces 1 output. Therefore, when coding
__getitem__method, you should return the batch that also contains 2 inputs and 1 output.
- Do not mistakenly use
__len__method as it is different from
Finally, the training is straightforward.
#create an instance of the generator with proper dataset
train_generator = DataGenerator(dataset=train)
history = model.fit(train_generator, epochs=50)
Train for 5000 steps
5000/5000 [==============================] - 37s 7ms/step - loss: 9.4502 - mae: 2.7821 - mse: 9.4491
5000/5000 [==============================] - 36s 7ms/step - loss: 2.0485 - mae: 1.1212 - mse: 2.0442
5000/5000 [=======================>......] - 36s 7ms/step - loss: 1.1734 - mae: 0.8398 - mse: 1.1675
While Keras provides data generators, they also have limitations. One of the reasons is that every task is needs a different data loader. Sometimes every image has one mask and some times several, sometimes the mask is saved as an image and sometimes it encoded, etc… For every task, we will probably need to tweak our data generator but the structure will stay the same.