pythontensorflowmemorygputraining-data

Tensorflow: How to fix ResourceExhaustedError?


I'm trying to recreate these: Hugging Face: Question Answering Task and Hugging Face: Question Answering NLP Course.

I was having this ResourceExhaustedError at the model.fit() part.

---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
Cell In[14], line 1
----> 1 model.fit(x=tf_train_set, batch_size=16, validation_data=tf_validation_set, epochs=3, callbacks=[callback])
ResourceExhaustedError: Graph execution error:

Detected at node 'tf_distil_bert_for_question_answering/distilbert/transformer/layer_._4/attention/dropout_14/dropout/random_uniform/RandomUniform' defined at (most recent call last):

*A bunch of files listed here*

Node: 'tf_distil_bert_for_question_answering/distilbert/transformer/layer_._4/attention/dropout_14/dropout/random_uniform/RandomUniform'
OOM when allocating tensor with shape[16,12,384,384] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node tf_distil_bert_for_question_answering/distilbert/transformer/layer_._4/attention/dropout_14/dropout/random_uniform/RandomUniform}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_9297]

I already tried lowering the batch_size. model.fit(x=tf_train_set, batch_size=16, validation_data=tf_validation_set, epochs=3, callbacks=[callback])

I also tried limiting GPU's memory growth Limiting GPU memory growth

Here are colab notebooks: Colab: Question Answering Task and Colab: Question Answering NLP Course


Solution

  • I added these lines at the beginning

    import os
    os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
    

    Note: Limiting GPU's memory growth, and setting the batch_size wouldn't be necessary.