I am currently trying to learn about reinforcement learning (RL). I am quite new to the field, and I apologize for the wall of text.
I have encountered many examples of RL using TensorFlow, Keras, Keras-rl, stable-baselines3, PyTorch, gym, etc. However, I have discovered an oddity in the example codes that I do not understand, and I need some guidance.
The oddity is in the use of gym’s observation spaces. In many examples, the custom environment includes initializing a gym observation space. However, this observation space seems never actually to be used. The environment state is many times created as a secondary variable. This variable is then changed based on the action, and then rewards are calculated.
My questions:
Why do we define the observation space if we do not use it? Furthermore, we cannot change the observation space. We cannot say observation_space[i] = 1
, for example. Is it strictly necessary to have the gym’s observation space? Is it used in the inheritance of the gym’s environment?
The same goes for the action space. Is it strictly necessary to use the gym’s spaces, or can you just use e.g., an array = [0,1,2]
?
A lot of work has gone into creating spaces such as Discrete, Box, MultiDiscrete etc. and I encounter them repeatedly in tutorials, so they must have a function. I have tried looking at the documentation and other resources online; however, I struggle to find the information I am lacking.
An example:
The examples often use a custom agent and custom network with a given environment (CartPole) or create a custom environment using an already built-in function like A2C, A3C, or PPO. It is therefore difficult to find examples that have both sides of the RL framework.
I have the following code from an example YouTube video https://www.youtube.com/watch?v=bD6V3rcr_54. The example code regulates the temperature of a shower allowing the agent to learn whether the temperature should go up or down. In the environment, the observation space is created as a Box. However, the state is handled without ever using the observation space. As the agent interacts with the environment, I cannot see where the observation space changes. Generating actions is done using the neural network; however, it uses the state, not the observation space. The state is changed using the state variable, not the observation space. The same is a problem for the action space. They just seem to exist. They seem only to be used to specify input and output shapes for the Neural network.
import numpy as np
from gym import Env
from gym.spaces import Discrete, Box
import random
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers.legacy import Adam
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory
class ShowerEnv(Env):
def __init__(self):
self.action_space = Discrete(3)
self.observation_space = Box(low=np.array([25]), high=np.array([50]))
self.state = 38 + random.randint(-3, 3)
self.shower_length = 60
def step(self, action):
self.state += action - 1
self.shower_length -= 1
if self.state >= 37 and self.state <= 39:
reward = 1
else:
reward = -1
if self.shower_length <= 0:
done = True
else:
done = False
self.state += random.uniform(-0.1, 0.1)
info = {}
return self.state, reward, done, info
def render(self):
pass
def reset(self):
self.state = 38 + random.randint(-3, 3)
self.shower_length = 60
return self.state
def build_model(states, actions):
model = Sequential()
model.add(Dense(24, activation='relu', input_shape=states))
model.add(Dense(24, activation='relu'))
model.add(Dense(actions, activation='linear'))
return model
def build_agent(model, actions):
policy = BoltzmannQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, memory=memory, policy=policy, nb_actions=actions, nb_steps_warmup=100,
target_model_update=1e-2)
return dqn
if __name__ == '__main__':
# Setup an environment
env = ShowerEnv()
states = env.observation_space.shape
actions = env.action_space.n
# build DQN model
model = build_model(states, actions)
dqn = build_agent(model, actions)
dqn.compile(Adam(learning_rate=1e-3), metrics=["mae"])
dqn.fit(env, nb_steps=150000, visualize=False, verbose=1)
# test model
scores = dqn.test(env, nb_episodes=100, visualize=False)
print(np.mean(scores.history["episode_reward"]))
Building new environments every time is not really ideal, it's scutwork. Sticking to the gym standard will save you tonnes of repetitive work.