I'm trying to train a TensorFlow experimental model from the official TensorFlow Object Detection API, specifically a RetinaNet model. However, I encounter the following error during the training process:
restoring or initializing model...
train | step: 0 | training until step 100...
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-19-d18d13cd1da2> in <cell line: 0>()
----> 1 model, eval_logs = tfm.core.train_lib.run_experiment(
2 distribution_strategy=distribution_strategy,
3 task=task,
4 mode='train_and_eval',
5 params=exp_config,
7 frames
/usr/local/lib/python3.11/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
57 e.message += " name: " + name
58 raise core._status_to_exception(e) from None
---> 59 except TypeError as e:
60 keras_symbolic_tensors = [x for x in inputs if _is_keras_symbolic_tensor(x)]
61 if keras_symbolic_tensors:
InvalidArgumentError: Graph execution error:
Detected at node retina_net_model/res_net/conv2d/Conv2D defined at (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
File "/tmp/__autograph_generated_file_fu7t7gj.py", line 30, in step_fn
File "/usr/local/lib/python3.11/dist-packages/official/vision/tasks/retinanet.py", line 327, in train_step
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/training.py", line 588, in __call__
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler
File "/usr/local/lib/python3.11/dist-packages/official/vision/modeling/retinanet_model.py", line 129, in call
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/training.py", line 588, in __call__
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/functional.py", line 514, in call
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/functional.py", line 671, in _run_internal_graph
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/layers/convolutional/base_conv.py", line 289, in call
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/layers/convolutional/base_conv.py", line 261, in convolution_op
Detected at node retina_net_model/res_net/conv2d/Conv2D defined at (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
File "/tmp/__autograph_generated_file_fu7t7gj.py", line 30, in step_fn
File "/usr/local/lib/python3.11/dist-packages/official/vision/tasks/retinanet.py", line 327, in train_step
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/training.py", line 588, in __call__
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler
File "/usr/local/lib/python3.11/dist-packages/official/vision/modeling/retinanet_model.py", line 129, in call
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/training.py", line 588, in __call__
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/functional.py", line 514, in call
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/functional.py", line 671, in _run_internal_graph
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 65, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/engine/base_layer.py", line 1142, in __call__
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/utils/traceback_utils.py", line 96, in error_handler
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/layers/convolutional/base_conv.py", line 289, in call
File "/usr/local/lib/python3.11/dist-packages/tf_keras/src/layers/convolutional/base_conv.py", line 261, in convolution_op
2 root error(s) found.
(0) INVALID_ARGUMENT: No DNN in stream executor.
[[{{node retina_net_model/res_net/conv2d/Conv2D}}]]
[[while/body/_1/while/NoOp/_31]]
(1) INVALID_ARGUMENT: No DNN in stream executor.
[[{{node retina_net_model/res_net/conv2d/Conv2D}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_loop_fn_32055]
I am using the exact notebook for training [https://www.tensorflow.org/tfmodels/vision/object_detection]. I didnt change anything.
What I've tried:
For my case, the error "InvalidArgumentError: No DNN in stream executor" for training MaskR-CNN on Colab disappeared after I downgraded Tensorflow and tf-models-official to the versions 2.17.1 and 2.17.0 (the old version is 2.18.0 for both), respectively. Hopefully, it will work for you.