[SOLVED] Understanding action & observation spaces in gym for custom environments and agents

Understanding action & observation spaces in gym for custom environments and agents

I am currently trying to learn about reinforcement learning (RL). I am quite new to the field, and I apologize for the wall of text.

I have encountered many examples of RL using TensorFlow, Keras, Keras-rl, stable-baselines3, PyTorch, gym, etc. However, I have discovered an oddity in the example codes that I do not understand, and I need some guidance.

The oddity is in the use of gym’s observation spaces. In many examples, the custom environment includes initializing a gym observation space. However, this observation space seems never actually to be used. The environment state is many times created as a secondary variable. This variable is then changed based on the action, and then rewards are calculated.

My questions:

Why do we define the observation space if we do not use it? Furthermore, we cannot change the observation space. We cannot say observation_space[i] = 1, for example. Is it strictly necessary to have the gym’s observation space? Is it used in the inheritance of the gym’s environment?

The same goes for the action space. Is it strictly necessary to use the gym’s spaces, or can you just use e.g., an array = [0,1,2]?

A lot of work has gone into creating spaces such as Discrete, Box, MultiDiscrete etc. and I encounter them repeatedly in tutorials, so they must have a function. I have tried looking at the documentation and other resources online; however, I struggle to find the information I am lacking.

An example:

The examples often use a custom agent and custom network with a given environment (CartPole) or create a custom environment using an already built-in function like A2C, A3C, or PPO. It is therefore difficult to find examples that have both sides of the RL framework.

I have the following code from an example YouTube video https://www.youtube.com/watch?v=bD6V3rcr_54. The example code regulates the temperature of a shower allowing the agent to learn whether the temperature should go up or down. In the environment, the observation space is created as a Box. However, the state is handled without ever using the observation space. As the agent interacts with the environment, I cannot see where the observation space changes. Generating actions is done using the neural network; however, it uses the state, not the observation space. The state is changed using the state variable, not the observation space. The same is a problem for the action space. They just seem to exist. They seem only to be used to specify input and output shapes for the Neural network.

import numpy as np
from gym import Env
from gym.spaces import Discrete, Box
import random
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers.legacy import Adam
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory


class ShowerEnv(Env):
    def __init__(self):
        self.action_space = Discrete(3)
        self.observation_space = Box(low=np.array([25]), high=np.array([50]))
        self.state = 38 + random.randint(-3, 3)
        self.shower_length = 60

    def step(self, action):
        self.state += action - 1
        self.shower_length -= 1

        if self.state >= 37 and self.state <= 39:
            reward = 1
        else:
            reward = -1

        if self.shower_length <= 0:
            done = True
        else:
            done = False

        self.state += random.uniform(-0.1, 0.1)

        info = {}

        return self.state, reward, done, info

    def render(self):
        pass

    def reset(self):
        self.state = 38 + random.randint(-3, 3)
        self.shower_length = 60
        return self.state


def build_model(states, actions):
    model = Sequential()
    model.add(Dense(24, activation='relu', input_shape=states))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model


def build_agent(model, actions):
    policy = BoltzmannQPolicy()
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(model=model, memory=memory, policy=policy, nb_actions=actions, nb_steps_warmup=100,
                   target_model_update=1e-2)
    return dqn

if __name__ == '__main__':
    # Setup an environment
    env = ShowerEnv()
    states = env.observation_space.shape
    actions = env.action_space.n

    # build DQN model
    model = build_model(states, actions)
    dqn = build_agent(model, actions)
    dqn.compile(Adam(learning_rate=1e-3), metrics=["mae"])
    dqn.fit(env, nb_steps=150000, visualize=False, verbose=1)

    # test model
    scores = dqn.test(env, nb_episodes=100, visualize=False)
    print(np.mean(scores.history["episode_reward"]))

Solution

The majority of the problems you see are demos and simple RL applications, where state space = observation space. They are used synonymously which is not really bad habit. In very few cases at the start you find the observation space being different from the state space, however, as problems get more realistic and grounded you find the difference and the need for both.
Don't use a regular array for your action space as discrete as it might seem, stick to the gym standard, which is why it is a standard. Gym tries to standardize RL so as you progress you can simply fit your environments and problems to different RL algos.

Building new environments every time is not really ideal, it's scutwork. Sticking to the gym standard will save you tonnes of repetitive work.