I'm using RLLib's PPOTrainer with a custom environment, I execute trainer.train()
two times, the first one completes successfully, but when I execute it for the second time it crashed with an error:
lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call (pid=15248) raise type(e)(node_def, op, message) (pid=15248)
tensorflow.python.framework.errors_impl.InvalidArgumentError:
Received a label value of 5 which is outside the valid range of [0, 5). >Label values: 5 5
(pid=15248) [[node default_policy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at /tensorflow_core/python/framework/ops.py:1751) ]]
Here's my code:
main.py
ModelCatalog.register_custom_preprocessor("tree_obs_prep", TreeObsPreprocessor)
ray.init()
trainer = PPOTrainer(env=MyEnv, config={
"train_batch_size": 4000,
"model": {
"custom_preprocessor": "tree_obs_prep"
}
})
for i in range(2):
print(trainer.train())
MyEnv.py
class MyEnv(rllib.env.MultiAgentEnv):
def __init__(self, env_config):
self.n_agents = 2
self.env = *CREATES ENV*
self.action_space = gym.spaces.Discrete(5)
self.observation_space = np.zeros((1, 12))
def reset(self):
self.agents_done = []
obs = self.env.reset()
return obs[0]
def step(self, action_dict):
obs, rewards, dones, infos = self.env.step(action_dict)
d = dict()
r = dict()
o = dict()
i = dict()
for i_agent in range(len(self.env.agents)):
if i_agent not in self.agents_done:
o[i_agent] = obs[i_agent]
r[i_agent] = rewards[i_agent]
d[i_agent] = dones[i_agent]
i[i_agent] = infos[i)agent]
d['__all__'] = dones['__all__']
for agent, done in dones.items():
if done and agent != '__all__':
self.agents_done.append(agent)
return o, r, d, i
I have no idea about what's the problem, any suggestion? What does this error mean?
This comment really helped me:
FWIW, I think such issues can happen if NaNs appear in the policy output. When that happens, you can get out of range errors.
Usually it's due to the observation or reward somehow becoming NaN, though it could be the policy diverging as well.
In my case, I had to modify my observations because the agent wasn't able to learn a policy and at some point in the training (at a random timestep) the returned action was NaN
.