A.I, Data and Software Engineering

Generate data on the fly – Keras data generator


Previously, we train our model using the pre-generated dataset, for example, in the recommender system or recurrent neural network. In this article, we will demonstrate using a generator to produce data on the fly for training a model.

Keras Data Generator with Sequence

There are a couple of ways to create a data generator. However, Tensorflow Keras provides a base class to fit dataset as a sequence.

To create our own data generator, we need to subclass tf.keras.utils.Sequence and  must implement the __getitem__ and the __len__ methods.

A generator should return a batch including (input, output) for training. This can be achieved by modify the method __getitem__. The scaffold would look like this.

import math
class MyDataGenerator(Sequence):
 def __init__(self, x_set, y_set, batch_size):
   self.x, self.y = x_set, y_set
   self.batch_size = batch_size
   //other code ...
 def __len__(self):
   return math.ceil(len(self.x) / self.batch_size)
 def __getitem__(self, idx):
    //generate X, y...
    return X, y

If you want to modify your dataset between epochs you may implement on_epoch_end.

Train with a generator

After creating a generator, you have two options. One is to use fit_generator method of Keras model. For example:


As the method is deprecated, we can use the same fit as model.fit.

fit(x=None, y=None, batch_size=15)

Remember that, when x is a generator, then we leave y untouch as the output should be included in the batch generated by the generator as shown from the flowing docstring.

x: Input data. It could be A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample weights). A more detailed description of unpacking behavior for iterator types (Dataset, generator, Sequence) is given below.
y: Target data. If x is a dataset, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x).

python demonstration

Supposed that we have a recommender model from this post. Now we create a data generator for training. We use movie lens dataset, you can refer to this post for downloading and parsing the data to a Panda dataframe.

Below is the complete generator class.

import math
class DataGenerator(Sequence):
    def __init__(self, dataset, batch_size=16, dim=(1), shuffle=True):
        self.dim = dim
        self.batch_size = batch_size
        self.dataset = dataset
        self.shuffle = shuffle
        self.indexes = dataset.index
    def __len__(self):
        'Denotes the number of batches per epoch'
        return math.ceil(len(self.dataset) / self.batch_size)
    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        idxs = [i for i in range(index*self.batch_size,(index+1)*self.batch_size)]
        # Find list of IDs
        list_IDs_temp = [self.indexes[k] for k in idxs]
        # Generate data
        User = dataset.loc[list_IDs_temp,['user_id']].to_numpy()#.reshape(-1)
        Item = dataset.loc[list_IDs_temp,['item_id']].to_numpy()#.reshape(-1)
        y = dataset.loc[list_IDs_temp,['rating']].to_numpy()#.reshape(-1)
        #print("u,i,r:", [User, Item],[y])
        return [User, Item],[y]
    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(dataset))
        if self.shuffle == True:

Some important note:

  • The recommender model takes 2 inputs and produces 1 output. Therefore, when coding __getitem__ method, you should return the batch that also contains 2 inputs and 1 output.
  • Do not mistakenly use tf.math.ceil in the __len__ method as it is different from math.ceil

Finally, the training is straightforward.

#create an instance of the generator with proper dataset
train_generator = DataGenerator(dataset=train)
history = model.fit(train_generator, epochs=50)
Train for 5000 steps
Epoch 1/50
5000/5000 [==============================] - 37s 7ms/step - loss: 9.4502 - mae: 2.7821 - mse: 9.4491
Epoch 2/50
5000/5000 [==============================] - 36s 7ms/step - loss: 2.0485 - mae: 1.1212 - mse: 2.0442
Epoch 3/50
5000/5000 [=======================>......] - 36s 7ms/step - loss: 1.1734 - mae: 0.8398 - mse: 1.1675
Epoch 4/50


While Keras provides data generators, they also have limitations. One of the reasons is that every task is needs a different data loader. Sometimes every image has one mask and some times several, sometimes the mask is saved as an image and sometimes it encoded, etc… For every task, we will probably need to tweak our data generator but the structure will stay the same.

