Error while using Tensorflow dqn_agent collect_policy

I'm using the tensorflow DQN Agent with a Simulink Environment. While calling the agents collect policy

agent.collect_policy.action(time_step)

I get the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node 
__wrapped__Select_device_/job:localhost/replica:0/task:0/device:CPU:0}} 'then' and 'else' must have the same size.  but received: [1] vs. [] [Op:Select] name:

calling the standard policy is working

agent.policy.action(time_step)

I double checked the wether my TimeStep matches my TimeStepSpec and it matches. (I guess the agent.policy wouldn't if it wouldn't match)

As far as I know the call of both policies is pretty similar in tf_policy.py so I have no idea what's causing the problem. If anybody has an idea what causes the error feel free to help :)

Heres a code snippet of my agent, etc. I hope this will help

the specification:

discount = 0.95

reward = 0.0

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

time_step_spec = TimeStep(step_type = tensor_spec.BoundedTensorSpec(shape=(1,), dtype=tf.int32, minimum=0, maximum=2),

                          reward = tensor_spec.TensorSpec(shape=(1,), dtype=tf.float32),

                          discount = tensor_spec.TensorSpec(shape=(1,), dtype=tf.float32), #fix

                          observation =  tensor_spec.TensorSpec(shape=(1,amountMachines), dtype=tf.int32)

                          )


num_possible_actions = 729

action_spec = tensor_spec.BoundedTensorSpec(

    shape=(), dtype=tf.int32, minimum=0, maximum=num_possible_actions - 1)


agent = dqn_agent.DqnAgent(

    time_step_spec,

    action_spec,

    q_network=model,

    optimizer=optimizer,

    epsilon_greedy= 1.0,

    td_errors_loss_fn=common.element_wise_squared_loss,

    train_step_counter=train_step_counter)

agent.initialize()

the call:

current_state = get_states() #gets a np.array looking like this [4,4,4,4,4,6]

current_state_batch = tf.expand_dims( tf.convert_to_tensor(current_state, dtype=tf.int32), axis=0




time_step = TimeStep(step_type=tf.convert_to_tensor([step_type], dtype=tf.int32),

                            reward=tf.convert_to_tensor([reward], dtype=tf.float32),

                            discount=tf.convert_to_tensor([discount], dtype=tf.float32),

                            observation= current_state_batch)




action_step = agent.collect_policy.action(time_step)

This is the whole error code:

Traceback (most recent call last):   File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\runpy.py", line 88,  in _run_code
    exec(code, run_globals)   File "c:\Users\STestUser\.vscode\extensions\ms-python.python-2023.20.0\pythonFiles\lib\python\debugpy\adapter/../..\debugpy\launcher/../..\debugpy\__main__.py", line 39, in <module>
    cli.main()   File "c:\Users\STestUser\.vscode\extensions\ms-python.python-2023.20.0\pythonFiles\lib\python\debugpy\adapter/../..\debugpy\launcher/../..\debugpy/..\debugpy\server\cli.py", line 430, in main
    run()   File "c:\Users\STestUser\.vscode\extensions\ms-python.python-2023.20.0\pythonFiles\lib\python\debugpy\adapter/../..\debugpy\launcher/../..\debugpy/..\debugpy\server\cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")   File "c:\Users\STestUser\.vscode\extensions\ms-python.python-2023.20.0\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   File "c:\Users\STestUser\.vscode\extensions\ms-python.python-2023.20.0\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,   File "c:\Users\STestUser\.vscode\extensions\ms-python.python-2023.20.0\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)   File "d:\Hochschule\Master\Masterarbeit\energy-efficiency-optimation\RL-Modell\OP10_QLearning.py", line 449, in <module>
    action_step = agent.collect_policy.action(time_step = time_step_t)        
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^           File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\site-packages\tf_agents\policies\tf_policy.py", line 333, in action
    step = action_fn(time_step=time_step, policy_state=policy_state, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\site-packages\tf_agents\utils\common.py", line 193, in with_check_resource_vars
    return fn(*fn_args, **fn_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^   File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\site-packages\tf_agents\policies\epsilon_greedy_policy.py", line 141, in _action
    action = tf.nest.map_structure(
             ^^^^^^^^^^^^^^^^^^^^^^   File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\site-packages\tensorflow\python\util\nest.py", line 629, in map_structure
    return nest_util.map_structure(
           ^^^^^^^^^^^^^^^^^^^^^^^^   File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\site-packages\tensorflow\python\util\nest_util.py", line 1168, in map_structure
    return _tf_core_map_structure(func, *structure, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\site-packages\tensorflow\python\util\nest_util.py", line 1208, in _tf_core_map_structure
    [func(*x) for x in entries],
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^   File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\site-packages\tensorflow\python\util\nest_util.py", line 1208, in <listcomp>
    [func(*x) for x in entries],
     ^^^^^^^^   File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\site-packages\tf_agents\policies\epsilon_greedy_policy.py", line 142, in <lambda>
    lambda g, r: tf.compat.v1.where(cond, g, r),
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None   File "C:\Users\STestUser\AppData\Local\anaconda3\Lib\site-packages\tensorflow\python\framework\ops.py", line 5888, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node 
__wrapped__Select_device_/job:localhost/replica:0/task:0/device:CPU:0}} 'then' and 'else' must have the same size.  but received: [1] vs. [] [Op:Select] name:

Solution

UPDATE: Found the error on my own: The problem was the batch_size.

I'm currently working with batch_size = 1. So I have to give the vars time_step like this:

reward=tf.convert_to_tensor([reward], dtype=tf.float32)

BUT for the time_step_spec I need to define it like this:

reward = tensor_spec.TensorSpec(shape=(), dtype=tf.float32)

So its shape() in the spec and the shape in time_step is shape(1,) which means 1D= BatchSize, 0D =actual Data