pythonnumpyrandomsizemultinomial

problem with np.random.multinomial and size option in python


My problem is the following: I have N people choosing between three objects [1,2,3] with probabilities [p_1,p_2,p_3] such that p_1+p_2+p_3=1. Let's call X_1,X_2,X_3 the counts of the chosen objects in one sample among the N people (then, for example, X_1 is the number of people choosing object 1 ).

The vector X_1,X_2,X_3 follows a multinomial distribution as in Wikipedia. It is well known that cov(X_1,X_2) (covariance between X_1,X_2)=-N*p_1*p_2.

I want to verify this covariance formula. I did two experiments and I got different results. I cannot understand why.

Attempt A I coded (with p_1=0.4,p_2=0.2,p_3=0.4 and N=50):

q=np.random.multinomial(50, [0.4,0.2,0.4],size=1000)
df=pd.DataFrame(q,columns=["X_1","X_2","X_3"])
cov_matrix=np.cov([df["X_1"],df["X_2"],df["X_3"]])

In my specific case, I got cov(X_1,X_2)=-4.44586486 : it is very similar to what I was expecting as -N*p_1*p_2=-50*0.4*0.2=-4

Attempt B (where I sequentially create samples of multinomial draws) I coded:

s=[1]*1000 # 1000 as the size
df["constant"]=s
df["X_1"]= df.apply(lambda x: np.random.multinomial(50, [0.4,0.2,0.4])[0],axis=1)
df["X_2"]= df.apply(lambda x: np.random.multinomial(50, [0.4,0.2,0.4])[1],axis=1)
df["X_3"]= df.apply(lambda x: np.random.multinomial(50, [0.4,0.2,0.4])[2],axis=1)
cov_matrix=np.cov([df["X_1"],df["X_2"],df["X_3"]])

In my specific case, I got cov(X_1,X_2)=-0.087452 : it is very different than what I was expecting (that is 4).

It seems to me the only difference between A and B is that in A size=1000, whereas in B I am creating a draw for each row of my dataframe.

Why do I get different results? Which mistakes I am making? There was a similar question here, but answers are not very helpful.


Solution

  • In the second example, each column is generated independent of the other columns, so each row is no longer a sample from a single multinomial distribution. (Note, for example, that the sums of rows are not 50.) In effect, each column is a binomial distribution that is independent of the other columns.

    The second example is equivalent to this (note that I am using the newer random API):

    In [142]: rng = np.random.default_rng()
    
    In [143]: p = rng.binomial(50, [0.4, 0.2, 0.4], size=(1000, 3))
    
    In [144]: p
    Out[144]: 
    array([[15, 12, 20],
           [22,  6, 22],
           [17, 12, 17],
           ...,
           [20, 13, 17],
           [27,  9, 24],
           [25,  9, 22]])
    
    In [145]: np.cov(p.T)
    Out[145]: 
    array([[13.09065465,  0.15038238, -0.22775375],
           [ 0.15038238,  7.85507107, -0.0142022 ],
           [-0.22775375, -0.0142022 , 11.87303203]])