Previously, we train our model using the pre-generated dataset, for example, in the recommender system or recurrent neural network. In this article, we will demonstrate using a generator to produce data on the fly for training a model.
Keras Data Generator with Sequence
There are a couple of ways to create a data generator. However, Tensorflow Keras provides a base class to fit dataset as a sequence.
To create our own data generator, we need to subclass tf.keras.utils.Sequence
and must implement the __getitem__
and the __len__
methods.
A generator should return a batch including (input, output) for training. This can be achieved by modify the method __getitem__
. The scaffold would look like this.
import math
class MyDataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
//other code ...
def __len__(self):
return math.ceil(len(self.x) / self.batch_size)
def __getitem__(self, idx):
//generate X, y...
return X, y
If you want to modify your dataset between epochs you may implement on_epoch_end
.
Train with a generator
After creating a generator, you have two options. One is to use method of Keras model. For example:fit_generator
model.fit_generator(generator=training_generator, validation_data=validation_generator, use_multiprocessing=True, workers=6)
As the method is deprecated, we can use the same fit
as model.fit
.
fit(x=None, y=None, batch_size=15)
Remember that, when x
is a generator, then we leave y
untouch as the output should be included in the batch generated by the generator as shown from the flowing docstring.
x: Input data. It could be A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample weights). A more detailed description of unpacking behavior for iterator types (Dataset, generator, Sequence) is given below.
y: Target data. If x is a dataset, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x).
python demonstration
Supposed that we have a recommender model from this post. Now we create a data generator for training. We use movie lens dataset, you can refer to this post for downloading and parsing the data to a Panda dataframe.
Below is the complete generator class.
import math
class DataGenerator(Sequence):
def __init__(self, dataset, batch_size=16, dim=(1), shuffle=True):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.dataset = dataset
self.shuffle = shuffle
self.indexes = dataset.index
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return math.ceil(len(self.dataset) / self.batch_size)
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
idxs = [i for i in range(index*self.batch_size,(index+1)*self.batch_size)]
# Find list of IDs
list_IDs_temp = [self.indexes[k] for k in idxs]
# Generate data
User = dataset.loc[list_IDs_temp,['user_id']].to_numpy()#.reshape(-1)
Item = dataset.loc[list_IDs_temp,['item_id']].to_numpy()#.reshape(-1)
y = dataset.loc[list_IDs_temp,['rating']].to_numpy()#.reshape(-1)
#print("u,i,r:", [User, Item],[y])
return [User, Item],[y]
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(dataset))
if self.shuffle == True:
np.random.shuffle(self.indexes)
Some important note:
- The recommender model takes 2 inputs and produces 1 output. Therefore, when coding
__getitem__
method, you should return the batch that also contains 2 inputs and 1 output. - Do not mistakenly use
tf.math.ceil
in the__len__
method as it is different frommath.ceil
Finally, the training is straightforward.
#create an instance of the generator with proper dataset
train_generator = DataGenerator(dataset=train)
#train
history = model.fit(train_generator, epochs=50)
Train for 5000 steps
Epoch 1/50
5000/5000 [==============================] - 37s 7ms/step - loss: 9.4502 - mae: 2.7821 - mse: 9.4491
Epoch 2/50
5000/5000 [==============================] - 36s 7ms/step - loss: 2.0485 - mae: 1.1212 - mse: 2.0442
Epoch 3/50
5000/5000 [=======================>......] - 36s 7ms/step - loss: 1.1734 - mae: 0.8398 - mse: 1.1675
Epoch 4/50
Conclusions
While Keras provides data generators, they also have limitations. One of the reasons is that every task is needs a different data loader. Sometimes every image has one mask and some times several, sometimes the mask is saved as an image and sometimes it encoded, etc… For every task, we will probably need to tweak our data generator but the structure will stay the same.