tensorflow reinforcement-learning openai-gym dqn tf-agent

Tf-agent Actor/Learner: TFUniform ReplayBuffer dimensionality issue - invalid shape of Replay Buffer vs. Actor update

I try to adapt the this tf-agents actor<->learner DQN Atari Pong example to my windows machine using a TFUniformReplayBuffer instead of the ReverbReplayBuffer which only works on linux machine but I face a dimensional issue.

    [...]
    ---> 67 init_buffer_actor.run()
    [...]
    InvalidArgumentError: {{function_node __wrapped__ResourceScatterUpdate_device_/job:localhost/replica:0/task:0/device:CPU:0}} Must have updates.shape = indices.shape + params.shape[1:] or updates.shape = [], got updates.shape [84,84,4], indices.shape [1], params.shape [1000,84,84,4] [Op:ResourceScatterUpdate]

The problem is as follows: The tf actor tries to access the replay buffer and initialize the it with a certain number random samples of shape (84,84,4) according to this deepmind paper but the replay buffer requires samples of shape (1,84,84,4).

My code is as follows:

    def train_pong(
        env_name='ALE/Pong-v5',
        initial_collect_steps=50000,
        max_episode_frames_collect=50000,
        batch_size=32,
        learning_rate=0.00025,
        replay_capacity=1000):
  
        # load atari environment
        collect_env = suite_atari.load(
            env_name,
            max_episode_steps=max_episode_frames_collect,
            gym_env_wrappers=suite_atari.DEFAULT_ATARI_GYM_WRAPPERS_WITH_STACKING)

        # create tensor specs
        observation_tensor_spec, action_tensor_spec, time_step_tensor_spec = (
            spec_utils.get_tensor_specs(collect_env))
  
        # create training util
        train_step = train_utils.create_train_step()
  
        # calculate no. of actions
        num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1
  
        # create agent
        agent = dqn_agent.DqnAgent(
            time_step_tensor_spec,
            action_tensor_spec,
            q_network=create_DL_q_network(num_actions),
            optimizer=tf.compat.v1.train.RMSPropOptimizer(learning_rate=learning_rate))
    
        # create uniform replay buffer
        replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
            data_spec=agent.collect_data_spec,
            batch_size=1,
            max_length=replay_capacity)

        # observer of replay buffer
        rb_observer = replay_buffer.add_batch
        
        # create batch dataset
        dataset = replay_buffer.as_dataset(
            sample_batch_size=batch_size,
            num_steps = 2,
            single_deterministic_pass=False).prefetch(3)
    
        # create callable function for actor
        experience_dataset_fn = lambda: dataset
  
        # create random policy for buffer init
        random_policy = random_py_policy.RandomPyPolicy(collect_env.time_step_spec(),
                                                  collect_env.action_spec())
  
        # create initalizer
        init_buffer_actor = actor.Actor(
            collect_env,
            random_policy,
            train_step,
            steps_per_run=initial_collect_steps,
            observers=[replay_buffer.add_batch])

        # initialize buffer with random samples
        init_buffer_actor.run()

(The approach is using the OpenAI Gym Env as well as the corresponding wrapper functions)

I worked with keras-rl2 and tf-agents without actor<->learner for other atari games to create the DQN and both worked quite well afer a some adaptions. I guess my current code will also work after a few adaptions in the tf-agent libary functions, but that would obviate the purpose of the libary.

My current assumption: The actor<->learner methods are not able to work with the TFUniformReplayBuffer (as I expect them to), due to the missing support of the TFPyEnvironment - or I still have some knowledge shortcomings regarding this tf-agents approach

Previous (successful) attempt:

    from tf_agents.environments.tf_py_environment import TFPyEnvironment
    tf_collect_env = TFPyEnvironment(collect_env)
    init_driver = DynamicStepDriver(
        tf_collect_env,
        random_policy,
        observers=[replay_buffer.add_batch],
        num_steps=200) 
     init_driver.run()

I would be very grateful if someone could explain me what I'm overseeing here.

Solution

the full fix is shown below...

--> The dimensionality issue was valid and should indicate the the (uploaded) batched samples are not in the correct shape

--> This issue happens due to the fact that the "add_batch" method loads values with the wrong shape

    rb_observer = replay_buffer.add_batch

Long story short, this line should be replaced by

  rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))

--> Afterwards the (replay buffer) inputs are of correct shape and the Learner Actor Setup starts training.

The full replay buffer is shown below:

  # create buffer for storing experience
  replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    agent.collect_data_spec,
    1,
    max_length=1000000)

 # create batch dataset
  dataset = replay_buffer.as_dataset( 
      sample_batch_size=32,
      num_steps = 2,
      single_deterministic_pass=False).prefetch(4)

  # create batched nested array input for rb_observer
  rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))

  # create batched readout of dataset
  experience_dataset_fn = lambda: dataset