Continue training big models on less powerful devices

It would not be a surprise that you may not have a powerful expensive machine to train a complicate model. You may experience the problem of not enough memory during training in some epoch. This article demonstrates a simple workaround for this.

The problem

Training deep learning models requires a lot of computing power. For most laptop and desktop today, you can still train the models but it can be a very time-consuming job. Unfortunately, for some model, your computer’ configuration is not enough to go through the task.

The common problem that you may encounter during training (e.g. Google Collab) is “Your session crashed after using all RAM “. What would be an inexpensive solution?

The workaround

Please note that this solution can ONLY be applied for machines which can finish loading data and complete at least one epoch of training.

Well, the idea is simple: save the training state to file, release all memory, reload the model and continue training the next epoch.

Create a model and check-points

Suppose that we have a deep learning model:

def create_model():
    model = tf.keras.Sequential()
    ...
    model.add(tf.keras.layers.Flatten())
    ...
    #Compiling the model
    model.compile(loss='sparse_categorical_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])
    return model

To save this model during training, we create a check-point callback using the following syntax:

tf.keras.callbacks.ModelCheckpoint(
    filepath, monitor='val_loss', verbose=0, save_best_only=False,
    save_weights_only=False, mode='auto', save_freq='epoch', **kwargs
)

As we want to save all state of the training model, we leave save_weights_only to False. Additionally, if we want to identify the saved model for each epoch, we can modify the file path with variables.

checkpoint_path= saved_model_dir + "/model-{epoch:02d}-{val_accuracy:.2f}.hdf5"
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, monitor='val_accuracy',verbose=1, save_best_only=True, mode='max')

We use the callback during fitting the model.

history = model.fit(train_generator, epochs=50, callbacks=[cp_callback,])

The model will train and save the best model as follows.

Epoch 00003: mlp_binary_accuracy improved from 0.76328 to 0.76360, saving model to saved_models//model-03-0.76.hdf5
500/500 - 26s - loss: 2.3871 -....
Epoch 4/50
Epoch 00004: mlp_binary_accuracy did not improve from 0.76360
500/500 - 26s - loss: ....val_accuracy: 0.7635
Epoch 5/50
Epoch 00005: mlp_binary_accuracy improved from 0.76360 to 0.76410, saving model to saved_models//model-05-0.76.hdf5
500/500 - 26s - loss: 2.3786 - use... val_accuracy: 0.7641
Epoch 6/50
...

Release memory and reload last saved model

if you use Jupyter Notebook, you can release the memory by using the following cell command.

import os
os._exit(00)

To load the last saved model, we use tf.keras.models.load_model as shown in the following code:

import glob
saved_list = glob.glob(saved_model_dir+'/*') # * means all if need specific format then *.csv
if load_saved_model == True:
  if(len(saved_list) != 0):
    last_saved = max(saved_list, key=os.path.getctime)
    print("Load ", last_saved)
    model = tf.keras.models.load_model(last_saved)

Happy saving model and saving money for upgrading your device! Read more for save and load model in this article.

check-point deep learning keras model out of memory save tensorflow

Continue training big models on less powerful devices

The problem

The workaround

Create a model and check-points

Release memory and reload last saved model

Add comment

💬Cancel reply

The problem

The workaround

Create a model and check-points

Release memory and reload last saved model

Share this:

Add comment

💬Cancel reply

Read more

Categories