python-2.7if-statementrandomreinforcement-learningsoftmax

Is there a better way than this to implement Softmax Action Selection for Reinforcement Learning?


I am implementing Softmax Action Selection policy for a reinforcement learning task (http://www.incompleteideas.net/book/ebook/node17.html).

I came with this solution, but I think there is room for improvement.

1-Here I evaluate the probabilities

    prob_t = [0]*3
    denominator = 0
    for a in range(nActions):
        denominator += exp(Q[state][a] / temperature) 

    for a in range(nActions):
        prob_t[a] = (exp(Q[state][a]/temperature))/denominator  

2-Here I am comparing a random generated number in the range ]0,1[ to the probabilities value of the actions:

    rand_action = random.random()
    if rand_action < prob_t[0]:
        action = 0      
    elif rand_action >= prob_t[0] and rand_action < prob_t[1]+prob_t[0]:
        action = 1      
    else: #if rand_action >= prob_t[1]+prob_t[0]
        action = 2

edit:

example: rand_action is 0.78, prob_t[0] is 0.25, prob_t[1] is 0.35, prob_t[2] is 0.4. the probabilities sum to 1. 0.78 is greater than the sum of the probabilities for action 0 and 1 (prob_t[0] + prob_t[1]) therefore action 2 is picked.

Is there a more efficient way of doing this?


Solution

  • After the suggestions to use numpy I did a bit of research and came with this solution for the first part of the soft-max implementation.

    prob_t = [0,0,0]       #initialise
    for a in range(nActions):
        prob_t[a] = np.exp(Q[state][a]/temperature)  #calculate numerators
    
    #numpy matrix element-wise division for denominator (sum of numerators)
    prob_t = np.true_divide(prob_t,sum(prob_t))      
    

    There's a for loop less than my initial solution. The only downside I can appreciate is a reduced precision.

    using numpy:

    [ 0.02645082  0.02645082  0.94709836]
    

    initial two-loops solution:

    [0.02645082063629476, 0.02645082063629476, 0.9470983587274104]