pythontensorflowmachine-learninggoogle-colaboratoryretinanet

InvalidArgumentError: No DNN in stream executor while training a TensorFlow RetinaNet model on Google Colab


I'm trying to train a TensorFlow experimental model from the official TensorFlow Object Detection API, specifically a RetinaNet model. However, I encounter the following error during the training process:

restoring or initializing model...
train | step:      0 | training until step 100...

---------------------------------------------------------------------------

InvalidArgumentError                      Traceback (most recent call last)

<ipython-input-19-d18d13cd1da2> in <cell line: 0>()
----> 1 model, eval_logs = tfm.core.train_lib.run_experiment(
      2     distribution_strategy=distribution_strategy,
      3     task=task,
      4     mode='train_and_eval',
      5     params=exp_config,

7 frames

/usr/local/lib/python3.11/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     57       e.message += " name: " + name
     58     raise core._status_to_exception(e) from None
---> 59   except TypeError as e:
     60     keras_symbolic_tensors = [x for x in inputs if _is_keras_symbolic_tensor(x)]
     61     if keras_symbolic_tensors:

InvalidArgumentError: Graph execution error:

Detected at node retina_net_model/res_net/conv2d/Conv2D defined at (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap

  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner

  File "/tmp/__autograph_generated_file_fu7t7gj.py", line 30, in step_fn

  File "/usr/local/lib/python3.11/dist-packages/official/vision/tasks/retinanet.py", line 327, in train_step

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/training.py", line 588, in __call__

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/official/vision/modeling/retinanet_model.py", line 129, in call

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/training.py", line 588, in __call__

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/functional.py", line 514, in call

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/functional.py", line 671, in _run_internal_graph

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/layers/convolutional/base_conv.py", line 289, in call

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/layers/convolutional/base_conv.py", line 261, in convolution_op

Detected at node retina_net_model/res_net/conv2d/Conv2D defined at (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap

  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner

  File "/tmp/__autograph_generated_file_fu7t7gj.py", line 30, in step_fn

  File "/usr/local/lib/python3.11/dist-packages/official/vision/tasks/retinanet.py", line 327, in train_step

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/training.py", line 588, in __call__

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/official/vision/modeling/retinanet_model.py", line 129, in call

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/training.py", line 588, in __call__

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/functional.py", line 514, in call

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/functional.py", line 671, in _run_internal_graph

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/layers/convolutional/base_conv.py", line 289, in call

  File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/layers/convolutional/base_conv.py", line 261, in convolution_op

2 root error(s) found.
  (0) INVALID_ARGUMENT:  No DNN in stream executor.
     [[{{node retina_net_model/res_net/conv2d/Conv2D}}]]
     [[while/body/_1/while/NoOp/_31]]
  (1) INVALID_ARGUMENT:  No DNN in stream executor.
     [[{{node retina_net_model/res_net/conv2d/Conv2D}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_loop_fn_32055]

I am using the exact notebook for training [https://www.tensorflow.org/tfmodels/vision/object_detection]. I didnt change anything.

What I've tried:

  1. TensorFlow version compatibility - ensuring I'm using the correct version of TensorFlow for the GPU.
  2. Checking GPU availability - confirmed that the GPU is being used in Colab.
  3. Reducing batch size - to avoid memory issues, but still facing the same error.

Solution

  • For my case, the error "InvalidArgumentError: No DNN in stream executor" for training MaskR-CNN on Colab disappeared after I downgraded Tensorflow and tf-models-official to the versions 2.17.1 and 2.17.0 (the old version is 2.18.0 for both), respectively. Hopefully, it will work for you.