tensorflowneural-networkramtensorflow-datasets

Tensorflow memory problems with validation data


Im trying to train a tensorflow neural network. Since the training data is too big for my computer RAM I have divided it into sub-datsets and sequentially train them. I have a problem however, if I call model.fit with the validation_data parameter the code returns an error. I i do not pass any validation data I works perfectly.

epochs = 30
optimizer = tf.keras.optimizers.SGD(learning_rate=lr)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
x_val, y_val, x_test, y_test = read_val(3, val_ratio=0.5)
print("start training")
for epoch in range(epochs):
        print("\n" + str(epoch))
        for n in range(3):
            x_t, y_t = read_small(n)
            history = model.fit(x_t, y_t, epochs=1, batch_size=32, verbose=0, validation_data=(x_val, y_val))
            x_t = []; y_t = [];

The read_small(n) essentially reads the sub-dataset for the training data. If call the function as: history = model.fit(x_t, y_t, epochs=1, batch_size=32, verbose=0)the full training is accomplished for the 30 epocs and no error is returned.

However if I call the function as given in the code: history = model.fit(x_t, y_t, epochs=1, batch_size=32, verbose=0, validation_data=(x_val, y_val)) the program returns the following error:

"C:\Users\PC\anaconda3\envs\CVISEnv\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in convert_to_eager_tensor return ops.EagerTensor(value, ctx.device_name, dtype) numpy.core._exceptions._ArrayMemoryError: Unable to allocate 477. MiB for an array with shape (125000, 1000) and data type float32

But the error happens at around epoch number 15, so it's not like the computer lacks RAM to read the validation data. What could be the problem? Does model.fit generate a copy of the validation data every time the function is called even though I already read them before?


Solution

  • My suspicion is that x_t = []; y_t = [] is not enough to delete the loaded training data, and that garbage accumulates over time, eventually filling the memory. A more definitive approach would be del x_t, y_t.

    In general, we can use the tf.data pipeline to manage datasets and to avoid loading all data at once. Probably check the intro in the docs. Something like the following (can't guarantee it will work):

    x_val, y_val, x_test, y_test = read_val(3, val_ratio=0.5)
    
    # Assuming x_val has shape [N, D], y_val has shape [N, 1]
    x_val_ds = tf.data.Dataset.from_tensor_slices(x_val)
    y_val_ds = tf.data.Dataset.from_tensor_slices(y_val)
    val_ds = tf.data.Dataset.zip(x_val, y_val)
    val_ds = val_ds.batch(100)  # Validate on 100 examples at a time
    
    ...
    
    # Then pass `val_ds` when training the model: 
    model.fit(..., validation_data=val_ds)