Q-table not updating in FrozenLake-v1 environment using Q-learning

I'm currently working on implementing Q-learning for the FrozenLake-v1 environment in OpenAI Gym. However, my Q-table doesn't seem to be updating during training, and it remains filled with zeros. I've reviewed my code multiple times, but I can't pinpoint the issue.

Here's the code I'm using:

import gymnasium as gym
import numpy as np
import random


def run():
    env = gym.make("FrozenLake-v1") # setup env
    Q = np.zeros((env.observation_space.n, env.action_space.n)) # empty q_table

    alpha = 0.7
    gamma = 0.95
    epsilon = 0.9
    epsilon_decay = 0.005
    epsilon_min = 0.01
    episode = 0
    episodes = 10000

    state, info = env.reset()

    print("Before training")
    print(Q)

    while episode < episodes:

        if epsilon > epsilon_min:
            epsilon -= epsilon_decay
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])

        new_state, reward, terminated, truncated, info = env.step(action)

        Q[state, action] = Q[state, action] + alpha * (float(reward) + gamma * np.max(Q[new_state]) - Q[state, action])

        state = new_state

        if terminated or truncated:
            episode += 1
            state, info = env.reset()  # Reset the environment

    print("After training")
    print(Q)
    env.close()


run()

I suspect the issue might be related to how I'm updating the Q-table or handling the environment states. Any help in identifying and resolving the problem would be greatly appreciated.

I added print statements to display intermediate values, including the selected actions, rewards, and the Q-table itself during training. This was to check if the values are updating as expected. I tried training the agent with a smaller number of episodes to simplify the problem and observe if the Q-table starts updating. However, even with a reduced number of episodes, the Q-table remains filled with zeros. I revisited the Q-table update formula to make sure it aligns with the Q-learning algorithm. The formula seems correct, but the issue persists.

I expected the Q-table to gradually update during training, reflecting the agent's learned values for state-action pairs. However, the Q-table remains unchanged, filled with zeros even after running the training loop for the specified number of episodes.

Solution

The issue is due to a combination of two problems:

your epsilon decays to its minimum value very quickly (in 178 steps),
you are using NumPy's argmax function.

If there are multiple maximum values in an array, np.argmax will return the first index at which the maximum value occurs. Initially, all values in the Q-table are 0, so whenever you are taking an exploitation step, you will take the first action, which in this case is 'move left'.

Except for arriving at the goal state, all rewards are zero, so only after you first find the goal state (and get a reward of 1) will the Q-table start to contain non-zero values. It is very unlikely that your agent will find the goal state in the first few hundred episodes, and since epsilon decays to 0.01 quickly, you are taking exploitation steps (i.e. moving left) most of the time, getting rewards of 0, and not making any meaningful updates to the Q-table.

Instead of np.argmax, I suggest using the following function, which returns a random index at which the maximum value occurs:

def argmax(arr):
    arr_max = np.max(arr)
    return np.random.choice(np.where(arr == arr_max)[0])

Also, these hyperparameters for epsilon are more sensible. Using this, epsilon will reach its minimum value at around half of the training:

epsilon = 1
epsilon_decay = (2 * epsilon) / episodes
epsilon_min = 0.001