reinforcement-learningdqn

Deep Q Learning - Cartpole Environment


I have a concern in understanding the Cartpole code as an example for Deep Q Learning. The DQL Agent part of the code as follow:

class DQLAgent:
def __init__(self, env):
    # parameter / hyperparameter
    self.state_size = env.observation_space.shape[0]
    self.action_size = env.action_space.n
    
    self.gamma = 0.95
    self.learning_rate = 0.001 
    
    self.epsilon = 1  # explore
    self.epsilon_decay = 0.995
    self.epsilon_min = 0.01
    
    self.memory = deque(maxlen = 1000)
    
    self.model = self.build_model()
    
    
def build_model(self):
    # neural network for deep q learning
    model = Sequential()
    model.add(Dense(48, input_dim = self.state_size, activation = "tanh"))
    model.add(Dense(self.action_size,activation = "linear"))
    model.compile(loss = "mse", optimizer = Adam(lr = self.learning_rate))
    return model

def remember(self, state, action, reward, next_state, done):
    # storage
    self.memory.append((state, action, reward, next_state, done))

def act(self, state):
    # acting: explore or exploit
    if random.uniform(0,1) <= self.epsilon:
        return env.action_space.sample()
    else:
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])

def replay(self, batch_size):
    # training
    if len(self.memory) < batch_size:
        return
    minibatch = random.sample(self.memory,batch_size)
    for state, action, reward, next_state, done in minibatch:
        if done:
            target = reward 
        else:
            target = reward + self.gamma*np.amax(self.model.predict(next_state)[0])
        train_target = self.model.predict(state)
        train_target[0][action] = target
        self.model.fit(state,train_target, verbose = 0)
        
def adaptiveEGreedy(self):
    if self.epsilon > self.epsilon_min:
        self.epsilon *= self.epsilon_decay

In the training section, we found our target and train_target. So why did we set train_target[0][action] = target here?

Every predict made while learning is not correct, but thanks to error calculation and backpropagation, the predict made at the end of the network will get closer and closer, but when we make train_target[0][action] = target here the error becomes 0, and in this case, how will the learning be?


Solution

  • self.model.predict(state) will return a tensor of shape of (1, 2) containing the estimated Q values for each action (in cartpole the action space is {0,1}). As you know the Q value is a measure of the expected reward.

    By setting self.model.predict(state)[0][action] = target (where target is the expected sum of rewards) it is creating a target Q value on which to train the model. By then calling model.fit(state, train_target) it is using the target Q value to train said model to approximate better Q values for each state.

    I don't understand why you are saying that the loss becomes 0: the target is set to the discounted sum of rewards plus the current reward

    target = reward + self.gamma*np.amax(self.model.predict(next_state)[0])
    

    while the network prediction for the highest Q value is

    np.amax(self.model.predict(next_state)[0])
    

    The loss between the target and the predicted values is what is used to train the model.

    Edit - more detailed explaination

    (you can ignore the [0] to the predicted values, as it is just to access the right column and unimportant in the understanding)

    The target variable is set to the sum between the current reward and the estimated sum of future rewards, or the Q value. Note that this variable is called target but it is not the target of the network, but the target Q value for the chosen action.

    The train_target variable is used as what you call the "dataset". It represents the target of the network.

    train_target = self.model.predict(state)
    train_target[0][action] = target
    

    You can clearly see that:

    the loss (mean squared error):

    prediction = self.model.predict(state)
    loss = (train_target - prediction)^2
    

    For any line of the that is not the the loss is 0. For the one line that has been set the loss is

    (target - prediction[action])^2
    

    or

    ((reward + self.gamma*np.amax(self.model.predict(next_state)[0])) - self.model.predict(state)[0][action])^2
    

    which is clearly different from 0.


    Note that this agent is not ideal. I would strongly recommend the use of a target model instead of creating target Q values that way. Check out this answer as for why.