[SOLVED] Q Learning Algorithm Issue

Q Learning Algorithm Issue

I'm trying to do a simple Q learning algorithm, but for whatever reason it doesn't converge. The agent should basically get from one point on the 5x5 grid to the goal one. When I run it it seems to have found the most optimal way however it doesn't converge and I can't figure out why. Any help would be appreciated. I feel like there is one little mistake somewhere so that's why I'm looking for a fresh set of eyes.

Code:

function Q=ReinforcementLearning

clc;

format short

format compact

% three input: R, alpha and gamma

% immediate reward matrix; 

% row and column = states; -Inf = no door between room

R=[-inf,-inf,    0,    0, -inf;

   -inf,-inf,    0,    0, -inf;

      0,   0, -inf, -inf,  100;

      0,   0, -inf, -inf, -inf;

   -inf,-inf,    0, -inf, 100];



gamma=0.8;            % learning parameter
alpha=0.5;
oldQ = 0;


Q=zeros(size(R));      % initialize Q as zero

q1=ones(size(R))*inf;  % initialize previous Q as big number

count=0;               % counter


for episode=0:50000   

   % random initial state

   y=randperm(size(R,1));

   state=y(1);




   % select any action from this state

   x=find(R(state,:)>=0);        % find possible action of this state

   if size(x,1)>0,

      x1=RandomPermutation(x);   % randomize the possible action

      x1=x1(1);                  % select an action 

   end


   MaxQ=max(Q,[],2);

   %Q(state,x1) = R(state,x1) + (gamma * MaxQ(x1)); %old function that works perfectly (converges)

   Q(state,x1)=  oldQ + alpha * (R(state,x1)+ (gamma * MaxQ(x1)) - oldQ); % new one that I need to implement

   oldQ = Q(state,x1);



   state=x1; %#ok<NASGU>


   %Q = round(Q);


   % break if convergence: small deviation on q for 1000 consecutive

   if sum(sum(abs(q1-Q)))<5 & sum(sum(Q > 0))

      if count>1000,

         episode        % report last episode

         break          % for

      else

         count=count+1; % set counter if deviation of q is small

      end

   else

      q1=Q;

      count=0; % reset counter when deviation of q from previous q is large

   end

end


%normalize q

g=max(max(Q));

episode

if g>0, 

   Q=100*Q/g;
   roundQ = round(Q);
   roundQ

end

Solution

EDIT:

The problem is here:

Q(state,x1)=  oldQ + alpha * (R(state,x1)+ (gamma * MaxQ(x1)) - oldQ);

you are updating each Q(state,x1) with an oldQ value, whereas this value should be updated by its own previous value Q(state,x1) before taking this action. Do you see it? it is better to have a matrix for Q and save the previous values that you can use them here. As I can see you are using the same oldQ for all the **Q**s.

Actually you can load the previous values from q1 and use them instead of oldQ.