I am trying to follow tensorflow's REINFORCE agent tutorial. It works when I use their code, but when I substitute my own environment I get this error:
Received incompatible tensor at flattened index 0 from table 'uniform_table'. Specification has (dtype, shape): (int32, [?]). Tensor has (dtype, shape): (int32, [92,1]).
Table signature: 0: Tensor<name: 'step_type/step_type', dtype: int32, shape: [?]>, 1: Tensor<name: 'observation/observation', dtype: double, shape: [?,18]>, 2: Tensor<name: 'action/action', dtype: float, shape: [?,2]>, 3: Tensor<name: 'next_step_type/step_type', dtype: int32, shape: [?]>, 4: Tensor<name: 'reward/reward', dtype: float, shape: [?]>, 5: Tensor<name: 'discount/discount', dtype: float, shape: [?]> [Op:IteratorGetNext]
This is interesting because 92 is exactly the number of steps in the episode.
The table signature when using my environment is:
Trajectory(
{'action': BoundedTensorSpec(shape=(None, 2), dtype=tf.float32, name='action', minimum=array(0., dtype=float32), maximum=array(3.4028235e+38, dtype=float32)),
'discount': BoundedTensorSpec(shape=(None,), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
'next_step_type': TensorSpec(shape=(None,), dtype=tf.int32, name='step_type'),
'observation': BoundedTensorSpec(shape=(None, 18), dtype=tf.float64, name='observation', minimum=array([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
7.5189e+02, 6.1000e-01, 1.0860e+01, 1.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00]), maximum=array(1.79769313e+308)),
'policy_info': (),
'reward': TensorSpec(shape=(None,), dtype=tf.float32, name='reward'),
'step_type': TensorSpec(shape=(None,), dtype=tf.int32, name='step_type')})
And when using the working tutorial environment:
Trajectory(
{'action': BoundedTensorSpec(shape=(None,), dtype=tf.int64, name='action', minimum=array(0), maximum=array(1)),
'discount': BoundedTensorSpec(shape=(None,), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
'next_step_type': TensorSpec(shape=(None,), dtype=tf.int32, name='step_type'),
'observation': BoundedTensorSpec(shape=(None, 4), dtype=tf.float32, name='observation', minimum=array([-4.8000002e+00, -3.4028235e+38, -4.1887903e-01, -3.4028235e+38],
dtype=float32), maximum=array([4.8000002e+00, 3.4028235e+38, 4.1887903e-01, 3.4028235e+38],
dtype=float32)),
'policy_info': (),
'reward': TensorSpec(shape=(None,), dtype=tf.float32, name='reward'),
'step_type': TensorSpec(shape=(None,), dtype=tf.int32, name='step_type')})
The only dimensional differences are that in my case the agent produces an action composed of 2 scalar numbers while in the tutorial the action is composed of only one, and my observation is longer. Regardless, the unknown dimension precedes the known dimension.
The trajectories that are used as input for the replay buffer also match up; I printed their dimensions as they were created first for my version:
[(92, 1), (92, 1, 18), (92, 1, 2), (92, 1), (92, 1), (92, 1)]
[(92, 1), (92, 1, 18), (92, 1, 2), (92, 1), (92, 1), (92, 1)]
[(92, 1), (92, 1, 18), (92, 1, 2), (92, 1), (92, 1), (92, 1)]
[(92, 1), (92, 1, 18), (92, 1, 2), (92, 1), (92, 1), (92, 1)]
and then for the tutorial version:
[(9, 1), (9, 1, 4), (9, 1), (9, 1), (9, 1), (9, 1)]
[(11, 1), (11, 1, 4), (11, 1), (11, 1), (11, 1), (11, 1)]
[(10, 1), (10, 1, 4), (10, 1), (10, 1), (10, 1), (10, 1)]
[(10, 1), (10, 1, 4), (10, 1), (10, 1), (10, 1), (10, 1)]
[(10, 1), (10, 1, 4), (10, 1), (10, 1), (10, 1), (10, 1)]
[(10, 1), (10, 1, 4), (10, 1), (10, 1), (10, 1), (10, 1)]
[(9, 1), (9, 1, 4), (9, 1), (9, 1), (9, 1), (9, 1)]
[(9, 1), (9, 1, 4), (9, 1), (9, 1), (9, 1), (9, 1)]
[(9, 1), (9, 1, 4), (9, 1), (9, 1), (9, 1), (9, 1)]
[(9, 1), (9, 1, 4), (9, 1), (9, 1), (9, 1), (9, 1)]
So each of the entries in the trajectory for both versions have the shape (number of steps, batch size, (if entry itself is a list) value length).
I get the error mentioned at the start of the question when running the second of these two lines of code:
iterator = iter(replay_buffer.as_dataset(sample_batch_size=1))
trajectories, _ = next(iterator)
However, these lines of code run successfully using the tutorial's code, and 'trajectories' is as follows:
Trajectory(
{'action': <tf.Tensor: shape=(1, 50), dtype=int64, numpy=
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,
1, 0, 0, 0, 1, 1]])>,
'discount': <tf.Tensor: shape=(1, 50), dtype=float32, numpy=
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
0., 1.]], dtype=float32)>,
'next_step_type': <tf.Tensor: shape=(1, 50), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 2, 0]], dtype=int32)>,
'observation': <tf.Tensor: shape=(1, 50, 4), dtype=float32, numpy=
array([[[ 0.02992676, 0.01392324, 0.03861422, -0.04107672],
[ 0.03020522, -0.18173054, 0.03779269, 0.26353496],
[ 0.02657061, -0.37737098, 0.04306339, 0.56789446],
[ 0.01902319, -0.18287869, 0.05442128, 0.2890832 ],
[ 0.01536562, 0.01142669, 0.06020294, 0.0140486 ],
[ 0.01559415, 0.20563589, 0.06048391, -0.25904846],
[ 0.01970687, 0.39984456, 0.05530294, -0.53205734],
[ 0.02770376, 0.59414685, 0.04466179, -0.80681443],
[ 0.0395867 , 0.39844212, 0.02852551, -0.50042385],
[ 0.04755554, 0.2029299 , 0.01851703, -0.19888948],
[ 0.05161414, 0.39778218, 0.01453924, -0.48567408],
[ 0.05956978, 0.59269595, 0.00482576, -0.7737395 ],
[ 0.0714237 , 0.39750797, -0.01064903, -0.47954214],
[ 0.07937386, 0.5927786 , -0.02023987, -0.7755622 ],
[ 0.09122943, 0.3979408 , -0.03575112, -0.48931554],
[ 0.09918825, 0.20334099, -0.04553743, -0.20811091],
[ 0.10325507, 0.39908352, -0.04969965, -0.5148037 ],
[ 0.11123674, 0.59486884, -0.05999572, -0.82272476],
[ 0.12313411, 0.40061677, -0.07645022, -0.54949903],
[ 0.13114645, 0.20664726, -0.0874402 , -0.2818491 ],
[ 0.1352794 , 0.01287431, -0.09307718, -0.0179748 ],
[ 0.13553688, -0.18079808, -0.09343667, 0.24395113],
[ 0.13192092, -0.37446988, -0.08855765, 0.50576115],
[ 0.12443152, -0.17821889, -0.07844243, 0.18653633],
[ 0.12086715, 0.01793264, -0.0747117 , -0.12982464],
[ 0.1212258 , -0.17604397, -0.0773082 , 0.13838378],
[ 0.11770492, 0.02009523, -0.07454053, -0.17765227],
[ 0.11810682, -0.17388523, -0.07809357, 0.09061581],
[ 0.11462912, 0.02226418, -0.07628125, -0.22564775],
[ 0.1150744 , -0.17168939, -0.08079421, 0.04203164],
[ 0.11164062, 0.02449259, -0.07995357, -0.27500907],
[ 0.11213046, -0.16940299, -0.08545376, -0.00857614],
[ 0.10874241, -0.36320207, -0.08562528, 0.2559689 ],
[ 0.10147836, -0.5570038 , -0.0805059 , 0.52046335],
[ 0.09033829, -0.3608463 , -0.07009663, 0.20353697],
[ 0.08312136, -0.55489945, -0.06602589, 0.47331032],
[ 0.07202338, -0.7490298 , -0.05655969, 0.7444739 ],
[ 0.05704278, -0.5531748 , -0.04167021, 0.43454146],
[ 0.04597928, -0.35748845, -0.03297938, 0.12901925],
[ 0.03882951, -0.16190998, -0.03039899, -0.17388314],
[ 0.03559131, 0.03363356, -0.03387666, -0.47599885],
[ 0.03626398, 0.22921707, -0.04339663, -0.77916366],
[ 0.04084833, 0.42490798, -0.05897991, -1.0851783 ],
[ 0.04934648, 0.6207563 , -0.08068347, -1.39577 ],
[ 0.06176161, 0.4267255 , -0.10859887, -1.1293658 ],
[ 0.07029612, 0.623089 , -0.13118619, -1.4540412 ],
[ 0.0827579 , 0.42979917, -0.16026701, -1.205056 ],
[ 0.09135389, 0.23706956, -0.18436813, -0.96658343],
[ 0.09609528, 0.04483784, -0.2036998 , -0.7370203 ],
[ 0.09699203, 0.24210311, -0.2184402 , -1.0862749 ]]],
dtype=float32)>,
'policy_info': (),
'reward': <tf.Tensor: shape=(1, 50), dtype=float32, numpy=
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 0.]], dtype=float32)>,
'step_type': <tf.Tensor: shape=(1, 50), dtype=int32, numpy=
array([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2]], dtype=int32)>})
So when everything is working correctly, feeding trajectories with entries of shape (number of steps, batch size, (if entry itself is a list) value length) to the replay buffer facilitates creation of a dataset where each entry in each row has shape (batch size, number of steps, (if entry itself is a list) value length).
However, in my version, each entry in each row of the dataset keeps its original shape, causing the error. Does anyone experienced with reverb know why this might be happening?
I did a lot more digging into the tensorflow backend and the problem is caused by the fact that the cartpole gym wrapper creates a non-batched python environment while the default is a batched environment, so when I run my code an additional (batch) dimension is being added to the trajectories before they are stored in the reverb table. However, since I am using the same table signature, when I attempt to pull an entry out of the table an exception is raised that the dimensions are incorrect because that signature conflicts with the actual shape of the entries