I am implementing Softmax Action Selection policy for a reinforcement learning task (http://www.incompleteideas.net/book/ebook/node17.html).
I came with this solution, but I think there is room for improvement.
1-Here I evaluate the probabilities
prob_t = [0]*3
denominator = 0
for a in range(nActions):
denominator += exp(Q[state][a] / temperature)
for a in range(nActions):
prob_t[a] = (exp(Q[state][a]/temperature))/denominator
2-Here I am comparing a random generated number in the range ]0,1[ to the probabilities value of the actions:
rand_action = random.random()
if rand_action < prob_t[0]:
action = 0
elif rand_action >= prob_t[0] and rand_action < prob_t[1]+prob_t[0]:
action = 1
else: #if rand_action >= prob_t[1]+prob_t[0]
action = 2
edit:
example: rand_action is 0.78, prob_t[0] is 0.25, prob_t[1] is 0.35, prob_t[2] is 0.4. the probabilities sum to 1. 0.78 is greater than the sum of the probabilities for action 0 and 1 (prob_t[0] + prob_t[1]) therefore action 2 is picked.
Is there a more efficient way of doing this?
After the suggestions to use numpy I did a bit of research and came with this solution for the first part of the soft-max implementation.
prob_t = [0,0,0] #initialise
for a in range(nActions):
prob_t[a] = np.exp(Q[state][a]/temperature) #calculate numerators
#numpy matrix element-wise division for denominator (sum of numerators)
prob_t = np.true_divide(prob_t,sum(prob_t))
There's a for loop less than my initial solution. The only downside I can appreciate is a reduced precision.
using numpy:
[ 0.02645082 0.02645082 0.94709836]
initial two-loops solution:
[0.02645082063629476, 0.02645082063629476, 0.9470983587274104]