I am reading Sutton&Barto's "Reinforcement Learning: An Introduction". Trying to test Gradient-bandit agent (chapter 2.7). But performance is extremely low. I've tried:
Nothing helps.
This is my Python-code for an agent's life step = action selection + params update (self = agent):
# for probabilities calculation:
pref_exps = np.exp(self.params["preferences"])
pref_exps_sum = sum(pref_exps)
# choosing a bandit:
choice_dice = np.random.uniform() * pref_exps_sum
accum_pref_exp = 0
for i, pref_exp in enumerate(pref_exps):
accum_pref_exp += pref_exp
if accum_pref_exp >= choice_dice:
self.chosen_bandit_i = i
break
# self.reward is filled here:
self.perform_bandit(self.chosen_bandit_i)
# updating baseline:
self.params["lifetime"] += 1
self.params["average_reward"] += 1 / self.params["lifetime"] * (self.reward - self.params["average_reward"])
# updating preferences:
for i, pref_exp in enumerate(pref_exps):
probability = pref_exp / pref_exps_sum
if i == self.chosen_bandit_i:
self.params["preferences"][i] += self.params["alpha"] * (self.reward - self.params["average_reward"]) * (1 - probability)
else:
self.params["preferences"][i] -= self.params["alpha"] * (self.reward - self.params["average_reward"]) * probability
This code lead to extremely poor performance (100 agents, each accessing its own 10 1-armed-bandits, testing over 2000 steps), which we can see from the lower plot:
I've seen this post and it seems that its code is effectively equal to mine after fixing the error, which was the reason of that post. But unlike my code, that post's code works properly when rectified!
I can't figure out where I've made a mistake. Can you help me to properly use the full power of Gradient-bandit agent?
Solved! I am sorry, the problem was outside of the aforementioned code. Correctness of best bandit choice was coded in a wrong way!
An interesting thing:
np.choice(a_list)
returns numpy.some_type
variable
! And when you compare this variable
to another_list
then numpy broadcasts this variable
and compares both as array-likes!
That was something I didn't know about / pay attention to, which made the actual error in code unbeknown to me.