I'm trying to do a simple Q learning algorithm, but for whatever reason it doesn't converge. The agent should basically get from one point on the 5x5 grid to the goal one. When I run it it seems to have found the most optimal way however it doesn't converge and I can't figure out why. Any help would be appreciated. I feel like there is one little mistake somewhere so that's why I'm looking for a fresh set of eyes.
Code:
function Q=ReinforcementLearning
clc;
format short
format compact
% three input: R, alpha and gamma
% immediate reward matrix;
% row and column = states; -Inf = no door between room
R=[-inf,-inf, 0, 0, -inf;
-inf,-inf, 0, 0, -inf;
0, 0, -inf, -inf, 100;
0, 0, -inf, -inf, -inf;
-inf,-inf, 0, -inf, 100];
gamma=0.8; % learning parameter
alpha=0.5;
oldQ = 0;
Q=zeros(size(R)); % initialize Q as zero
q1=ones(size(R))*inf; % initialize previous Q as big number
count=0; % counter
for episode=0:50000
% random initial state
y=randperm(size(R,1));
state=y(1);
% select any action from this state
x=find(R(state,:)>=0); % find possible action of this state
if size(x,1)>0,
x1=RandomPermutation(x); % randomize the possible action
x1=x1(1); % select an action
end
MaxQ=max(Q,[],2);
%Q(state,x1) = R(state,x1) + (gamma * MaxQ(x1)); %old function that works perfectly (converges)
Q(state,x1)= oldQ + alpha * (R(state,x1)+ (gamma * MaxQ(x1)) - oldQ); % new one that I need to implement
oldQ = Q(state,x1);
state=x1; %#ok<NASGU>
%Q = round(Q);
% break if convergence: small deviation on q for 1000 consecutive
if sum(sum(abs(q1-Q)))<5 & sum(sum(Q > 0))
if count>1000,
episode % report last episode
break % for
else
count=count+1; % set counter if deviation of q is small
end
else
q1=Q;
count=0; % reset counter when deviation of q from previous q is large
end
end
%normalize q
g=max(max(Q));
episode
if g>0,
Q=100*Q/g;
roundQ = round(Q);
roundQ
end
EDIT:
The problem is here:
Q(state,x1)= oldQ + alpha * (R(state,x1)+ (gamma * MaxQ(x1)) - oldQ);
you are updating each Q(state,x1) with an oldQ value, whereas this value should be updated by its own previous value Q(state,x1) before taking this action. Do you see it? it is better to have a matrix for Q and save the previous values that you can use them here. As I can see you are using the same oldQ for all the **Q**s.
Actually you can load the previous values from q1 and use them instead of oldQ.