I am training an RL agent to optimise dispatching in a job shop manufacturing system. My approach is based on this code: https://github.com/AndreasKuhnle/SimRLFab. It migrates the environment to a gymnasium environment and updates the Python version from Python 3.6 to 3.10. I am testing different algorithms such as PPO, TRPO and DQN. During training I noticed that the the mean reward per episode, the ep_re_mean in my tensorboard, decreases over time contrary to my expectation that it should be increasing. The reward function is the utilization rate of the machines and should be maximised. What could be the reason for this behaviour?
I am using a "self-made" gym environment and a simpy environment. As I am no considering myself as an expert, I thought it looks like it learns to minimize the reward, although it should not. Am I right with this thought? As far as I understand this, the utilization should be maximised, which is why it is positive and calculated as r_util = exp(util/1.5) - 1
The ep_rew_mean diagram from tensorboard: ep_rew_mean diagram from tensorboard
The losses from tensorboard. It seems to learn at least something. Although, I am not sure if it learns a wrong thing. loss and policy gradient loss from tensorboard value loss from tensorboard
The step function, calling the calculation of the reward function is:
def step(self, actions):
reward = None
terminal = False
states = None
truncated = False
info = {}
self.step_counter += 1
# print(self.counter, "Agent-Action: ", int(actions))
if (self.step_counter % self.parameters['EXPORT_FREQUENCY'] == 0 or self.step_counter % self.max_episode_timesteps == 0) \
and not self.parameters['EXPORT_NO_LOGS']:
self.export_statistics(self.step_counter, self.count_episode)
if self.step_counter == self.max_episode_timesteps:
print("Last episode action ", datetime.now())
truncated = True
# If multiple transport agents then for loop required
for agent in Transport.agents_waiting_for_action:
agent = Transport.agents_waiting_for_action.pop(0)
if self.parameters['TRANSP_AGENT_ACTION_MAPPING'] == 'direct':
agent.next_action = [int(actions)]
elif self.parameters['TRANSP_AGENT_ACTION_MAPPING'] == 'resource':
agent.next_action = [int(actions[0]), int(actions[1])]
agent.state_before = None
self.parameters['continue_criteria'].succeed()
self.parameters['continue_criteria'] = self.env.event()
self.env.run(until=self.parameters['step_criteria']) # Waiting until action is processed in simulation environment
# Simulation is now in state after action processing
reward, terminal = agent.calculate_reward(actions)
if terminal:
print("Last episode action ", datetime.now())
self.export_statistics(self.step_counter, self.count_episode)
agent = Transport.agents_waiting_for_action[0]
states = agent.calculate_state() # Calculate state for next action determination
if self.parameters['TRANSP_AGENT_ACTION_MAPPING'] == 'direct':
self.statistics['stat_agent_reward'][-1][3] = [int(actions)]
elif self.parameters['TRANSP_AGENT_ACTION_MAPPING'] == 'resource':
self.statistics['stat_agent_reward'][-1][3] = [int(actions[0]), int(actions[1])]
self.statistics['stat_agent_reward'][-1][4] = round(reward, 5)
self.statistics['stat_agent_reward'][-1][5] = agent.next_action_valid
self.statistics['stat_agent_reward'].append([self.count_episode, self.step_counter, round(self.env.now, 5),
None, None, None, states])
# done = truncated or terminal
#if truncated:
#self.reset()
return states, reward, terminal, truncated, info
The reward function is calculated like this:
def calculate_reward(self, action):
result_reward = self.parameters['TRANSP_AGENT_REWARD_INVALID_ACTION'] # = 0.0
result_terminal = False
if self.invalid_counter < self.parameters['TRANSP_AGENT_MAX_INVALID_ACTIONS']: # If true, then invalid action selected
if self.parameters['TRANSP_AGENT_REWARD'] == "valid_action":
result_reward = get_reward_valid_action(self, result_reward)
elif self.parameters['TRANSP_AGENT_REWARD'] == "utilization":
result_reward = get_reward_utilization(self, result_reward)
else:
self.invalid_counter = 0
result_reward = 0.0
# result_terminal = True
if self.next_action_valid:
self.invalid_counter = 0
self.counter_action_subsets[0] += 1
if self.next_action_destination != -1 and self.next_action_origin != -1 and self.next_action_destination.type == 'machine':
self.counter_action_subsets[1] += 1
elif self.next_action_destination != -1 and self.next_action_origin != -1 and self.next_action_destination.type == 'sink':
self.counter_action_subsets[2] += 1
# If explicit episode limits are set in configuration
if self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT'] > 0:
result_reward = 0.0
if (self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT_TYPE'] == 'valid' and self.counter_action_subsets[0] == self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT']) or \
(self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT_TYPE'] == 'entry' and self.counter_action_subsets[1] == self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT']) or \
(self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT_TYPE'] == 'exit' and self.counter_action_subsets[2] == self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT']) or \
(self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT_TYPE'] == 'time' and self.env.now - self.last_reward_calc_time > self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT']):
result_terminal = True
self.last_reward_calc_time = self.env.now
self.invalid_counter = 0
self.counter_action_subsets = [0, 0, 0]
if result_terminal:
if self.parameters['TRANSP_AGENT_REWARD_SPARSE'] == "utilization":
result_reward = get_reward_sparse_utilization(self)
elif self.parameters['TRANSP_AGENT_REWARD_SPARSE'] == "waiting_time":
result_reward = get_reward_sparse_waiting_time(self)
elif self.parameters['TRANSP_AGENT_REWARD_SPARSE'] == "valid_action":
result_reward = get_reward_sparse_valid_action(self)
else:
self.last_reward_calc_time = self.env.now
self.latest_reward = result_reward
return result_reward, result_terminal
def get_reward_utilization(transport_resource, invalid_reward):
result_reward = invalid_reward
if transport_resource.next_action_destination == -1 or transport_resource.next_action_origin == -1: # Waiting or empty action selected
result_reward = transport_resource.parameters['TRANSP_AGENT_REWARD_WAITING_ACTION'] # = 0.0
elif transport_resource.next_action_valid:
util = 0.0
for mach in transport_resource.resources['machines']:
util += mach.get_utilization_step() # calculation of utilization of machines
util = util / transport_resource.parameters['NUM_MACHINES']
transport_resource.last_reward_calc = util
result_reward = np.exp(util / 1.5) - 1.0
if transport_resource.next_action_destination.type == 'machine':
result_reward = transport_resource.parameters['TRANSP_AGENT_REWARD_SUBSET_WEIGHTS'][0] * result_reward # here the weight is = 1.0
else:
result_reward = transport_resource.parameters['TRANSP_AGENT_REWARD_SUBSET_WEIGHTS'][1] * result_reward # here the weight is = 1.0
return result_reward
The reset function looks like this:
def reset(self):
print("####### Reset Environment #######")
self.count_episode += 1
self.step_counter = 0
if self.count_episode == self.parameters['CHANGE_SCENARIO_AFTER_EPISODES']:
self.change_production_parameters()
print("Sim start time: ", self.statistics['sim_start_time'])
# Setup and start simulation
if self.env.now == 0.0:
print('Run machine shop simpy environment')
self.env.run(until=self.parameters['step_criteria'])
obs = np.array(self.resources['transps'][0].calculate_state())
info = {}
return obs, info
I already tried to check on the reward function, but as far as I understand, it works how I expect it to work.. Also, I checked if the reward transferred to the tensorboard is similar to the reward in my logging files. I read the post here Why does ep_re_mean decrease over time?, but it did not help me.. Does anyone have any idea why the mean reward per episode decreases over time? Note: I can provide more code if needed. Thanks in advance!
EDIT: My full code can be found here: JSP_Environment
The reason for the decreasing mean episodic reward is due to the way I designed the observation space e.g. if instead of 13 different observations I only provide insight into the total processing time as part of the observation space, the average episodic reward increases. If I use all 13 observations, the average episodic reward decreases. Hence the design of the state space causes the problem.