Table of contents
Previously, we train our model using the pre-generated dataset, for example, in the recommender system or recurrent neural network. In this article, we will demonstrate using a generator to produce data on the fly for training a model.
Keras Data Generator with Sequence
There are a couple of ways to create a data generator. However, Tensorflow Keras provides a base class to fit dataset as a sequence.
To create our own data generator, we need to subclass tf.keras.utils.Sequence
and must implement the __getitem__
and the __len__
methods.
A generator should return a batch including (input, output) for training. This can be achieved by modify the method __getitem__
. The scaffold would look like this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import math class MyDataGenerator(Sequence): def __init__(self, x_set, y_set, batch_size): self.x, self.y = x_set, y_set self.batch_size = batch_size //other code ... def __len__(self): return math.ceil(len(self.x) / self.batch_size) def __getitem__(self, idx): //generate X, y... return X, y |
If you want to modify your dataset between epochs you may implement on_epoch_end
.
Train with a generator
After creating a generator, you have two options. One is to use method of Keras model. For example:fit_generator
1 2 3 4 | model.fit_generator(generator=training_generator, validation_data=validation_generator, use_multiprocessing=True, workers=6) |
As the method is deprecated, we can use the same fit
as model.fit
.
1 | fit(x=None, y=None, batch_size=15) |
Remember that, when x
is a generator, then we leave y
untouch as the output should be included in the batch generated by the generator as shown from the flowing docstring.
1 2 3 | x: Input data. It could be A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample weights). A more detailed description of unpacking behavior for iterator types (Dataset, generator, Sequence) is given below. y: Target data. If x is a dataset, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x). |
python demonstration
Supposed that we have a recommender model from this post. Now we create a data generator for training. We use movie lens dataset, you can refer to this post for downloading and parsing the data to a Panda dataframe.
Below is the complete generator class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | import math class DataGenerator(Sequence): def __init__(self, dataset, batch_size=16, dim=(1), shuffle=True): 'Initialization' self.dim = dim self.batch_size = batch_size self.dataset = dataset self.shuffle = shuffle self.indexes = dataset.index self.on_epoch_end() def __len__(self): 'Denotes the number of batches per epoch' return math.ceil(len(self.dataset) / self.batch_size) def __getitem__(self, index): 'Generate one batch of data' # Generate indexes of the batch idxs = [i for i in range(index*self.batch_size,(index+1)*self.batch_size)] # Find list of IDs list_IDs_temp = [self.indexes[k] for k in idxs] # Generate data User = dataset.loc[list_IDs_temp,['user_id']].to_numpy()#.reshape(-1) Item = dataset.loc[list_IDs_temp,['item_id']].to_numpy()#.reshape(-1) y = dataset.loc[list_IDs_temp,['rating']].to_numpy()#.reshape(-1) #print("u,i,r:", [User, Item],[y]) return [User, Item],[y] def on_epoch_end(self): 'Updates indexes after each epoch' self.indexes = np.arange(len(dataset)) if self.shuffle == True: np.random.shuffle(self.indexes) |
Some important note:
- The recommender model takes 2 inputs and produces 1 output. Therefore, when coding
__getitem__
method, you should return the batch that also contains 2 inputs and 1 output. - Do not mistakenly use
tf.math.ceil
in the__len__
method as it is different frommath.ceil
Finally, the training is straightforward.
1 2 3 4 5 | #create an instance of the generator with proper dataset train_generator = DataGenerator(dataset=train) #train history = model.fit(train_generator, epochs=50) |
1 2 3 4 5 6 7 8 | Train for 5000 steps Epoch 1/50 5000/5000 [==============================] - 37s 7ms/step - loss: 9.4502 - mae: 2.7821 - mse: 9.4491 Epoch 2/50 5000/5000 [==============================] - 36s 7ms/step - loss: 2.0485 - mae: 1.1212 - mse: 2.0442 Epoch 3/50 5000/5000 [=======================>......] - 36s 7ms/step - loss: 1.1734 - mae: 0.8398 - mse: 1.1675 Epoch 4/50 |
Conclusions
While Keras provides data generators, they also have limitations. One of the reasons is that every task is needs a different data loader. Sometimes every image has one mask and some times several, sometimes the mask is saved as an image and sometimes it encoded, etc… For every task, we will probably need to tweak our data generator but the structure will stay the same.