I want to speed up the code below - namely the for loop. Is there a way to do it in numpy?
import numpy as np
# define seend and random state
rng = np.random.default_rng(0)
# num of throws
N = 10**1
# max number of trials
total_n_trials = 10
# generate the throws' distributions of "performace" - overall scores
# mu_throws = rng.normal(0.7, 0.1, N)
mu_throws = np.linspace(0,1,N)
all_trials = np.zeros(N*total_n_trials)
for i in range(N):
# generate trials per throw as Bernoulli trials with given mean
all_trials[i*total_n_trials:(i+1)*total_n_trials] = rng.choice([0,1], size=total_n_trials, p=[1-mu_throws[i],mu_throws[i]])
More explanations - I want to generate N
sequences of Bernoulli trials (ie. 0s and 1s, called throws
) where each sequence has a mean (probability p
) given by values in another array (mu_throws
). This can be sampled from a normal distribution or in this case, I took it for simplicity to be a sequence of N=10
numbers from 0 to 1. The approach above works but is slow. I expect N
to be at least $10^4$ and total_n_trials
then can be anything in the order of hundreds to (tens of) thousands (and this is run several times). I have checked the following post but didn't find an answer. I also know that numpy random choice
can generate multidimensional arrays but I didn't find a way to set a different set of p
s for different rows. Basically getting the same as what I do above just reshaped:
all_trials.reshape(N,-1)
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 1., 0., 0.],
[1., 0., 0., 1., 0., 0., 0., 1., 1., 0.],
[1., 0., 1., 0., 0., 1., 0., 1., 0., 1.],
[1., 0., 1., 0., 0., 0., 1., 1., 0., 0.],
[1., 0., 0., 1., 0., 1., 0., 1., 1., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
I also thought about this trick but haven't figured how to use it for Bernoulli trials. Thanks for any tips.
N = 11
mu_throws = np.linspace(0,1,N)
total_n_trials = 10_000
rng = np.random.default_rng(0)
all_trials = (rng.random((N, total_n_trials)).T<mu_throws).T.astype(int)
all_trials # shape (N, total_n_trials)
output:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0],
...,
[1, 0, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 0, 1, 1],
[1, 1, 1, ..., 1, 1, 1]])
Basically, what I'm doing is generate random real numbers in the interval [0, 1)
and convert them in boolean results in function of a given probability (in mu_throws
).
if you compare the mu_throws
(actual probability mean) and the probability estimated in all_trials
you have:
np.c_[mu_throws, all_trials.mean(1)]
output:
array([[0. , 0. ],
[0.1 , 0.1003],
[0.2 , 0.1963],
[0.3 , 0.305 ],
[0.4 , 0.4006],
[0.5 , 0.5056],
[0.6 , 0.5992],
[0.7 , 0.6962],
[0.8 , 0.7906],
[0.9 , 0.8953],
[1. , 1. ]])
For N
and total_n_trials
values from your example the time required on my machine is 0.00014019012451171875
sec, vs 0.0012738704681396484
sec of your loop.