I have a concern in understanding the Cartpole code as an example for Deep Q Learning. The DQL Agent part of the code as follow:
class DQLAgent:
def __init__(self, env):
# parameter / hyperparameter
self.state_size = env.observation_space.shape[0]
self.action_size = env.action_space.n
self.gamma = 0.95
self.learning_rate = 0.001
self.epsilon = 1 # explore
self.epsilon_decay = 0.995
self.epsilon_min = 0.01
self.memory = deque(maxlen = 1000)
self.model = self.build_model()
def build_model(self):
# neural network for deep q learning
model = Sequential()
model.add(Dense(48, input_dim = self.state_size, activation = "tanh"))
model.add(Dense(self.action_size,activation = "linear"))
model.compile(loss = "mse", optimizer = Adam(lr = self.learning_rate))
return model
def remember(self, state, action, reward, next_state, done):
# storage
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
# acting: explore or exploit
if random.uniform(0,1) <= self.epsilon:
return env.action_space.sample()
else:
act_values = self.model.predict(state)
return np.argmax(act_values[0])
def replay(self, batch_size):
# training
if len(self.memory) < batch_size:
return
minibatch = random.sample(self.memory,batch_size)
for state, action, reward, next_state, done in minibatch:
if done:
target = reward
else:
target = reward + self.gamma*np.amax(self.model.predict(next_state)[0])
train_target = self.model.predict(state)
train_target[0][action] = target
self.model.fit(state,train_target, verbose = 0)
def adaptiveEGreedy(self):
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
In the training section, we found our target and train_target. So why did we set train_target[0][action] = target
here?
Every predict made while learning is not correct, but thanks to error calculation and backpropagation, the predict made at the end of the network will get closer and closer, but when we make train_target[0][action] = target
here the error becomes 0, and in this case, how will the learning be?
self.model.predict(state)
will return a tensor of shape of (1, 2) containing the estimated Q values for each action (in cartpole the action space is {0,1}).
As you know the Q value is a measure of the expected reward.
By setting self.model.predict(state)[0][action] = target
(where target is the expected sum of rewards) it is creating a target Q value on which to train the model. By then calling model.fit(state, train_target)
it is using the target Q value to train said model to approximate better Q values for each state.
I don't understand why you are saying that the loss becomes 0: the target is set to the discounted sum of rewards plus the current reward
target = reward + self.gamma*np.amax(self.model.predict(next_state)[0])
while the network prediction for the highest Q value is
np.amax(self.model.predict(next_state)[0])
The loss between the target and the predicted values is what is used to train the model.
(you can ignore the [0] to the predicted values, as it is just to access the right column and unimportant in the understanding)
The target variable is set to the sum between the current reward and the estimated sum of future rewards, or the Q value. Note that this variable is called target but it is not the target of the network, but the target Q value for the chosen action.
The train_target variable is used as what you call the "dataset". It represents the target of the network.
train_target = self.model.predict(state)
train_target[0][action] = target
You can clearly see that:
train_target[<taken action>] = reward + self.gamma*np.amax(self.model.predict(next_state)[0])
train_target[<any other action>] = <prediction from the model>
the loss (mean squared error):
prediction = self.model.predict(state)
loss = (train_target - prediction)^2
For any line of the that is not the the loss is 0. For the one line that has been set the loss is
(target - prediction[action])^2
or
((reward + self.gamma*np.amax(self.model.predict(next_state)[0])) - self.model.predict(state)[0][action])^2
which is clearly different from 0.
Note that this agent is not ideal. I would strongly recommend the use of a target model instead of creating target Q values that way. Check out this answer as for why.