My problem is the following:
I have N
people choosing between three objects [1,2,3]
with probabilities [p_1,p_2,p_3]
such that p_1+p_2+p_3=1
. Let's call X_1,X_2,X_3
the counts of the chosen objects in one sample among the N
people (then, for example, X_1 is the number of people choosing object 1 ).
The vector X_1,X_2,X_3
follows a multinomial distribution as in Wikipedia.
It is well known that cov(X_1,X_2)
(covariance between X_1,X_2)=-N*p_1*p_2
.
I want to verify this covariance formula. I did two experiments and I got different results. I cannot understand why.
Attempt A
I coded (with p_1=0.4,p_2=0.2,p_3=0.4
and N=50
):
q=np.random.multinomial(50, [0.4,0.2,0.4],size=1000)
df=pd.DataFrame(q,columns=["X_1","X_2","X_3"])
cov_matrix=np.cov([df["X_1"],df["X_2"],df["X_3"]])
In my specific case, I got cov(X_1,X_2)=-4.44586486
: it is very similar to what I was expecting as -N*p_1*p_2=-50*0.4*0.2=-4
Attempt B (where I sequentially create samples of multinomial draws) I coded:
s=[1]*1000 # 1000 as the size
df["constant"]=s
df["X_1"]= df.apply(lambda x: np.random.multinomial(50, [0.4,0.2,0.4])[0],axis=1)
df["X_2"]= df.apply(lambda x: np.random.multinomial(50, [0.4,0.2,0.4])[1],axis=1)
df["X_3"]= df.apply(lambda x: np.random.multinomial(50, [0.4,0.2,0.4])[2],axis=1)
cov_matrix=np.cov([df["X_1"],df["X_2"],df["X_3"]])
In my specific case, I got cov(X_1,X_2)=-0.087452
: it is very different than what I was expecting (that is 4).
It seems to me the only difference between A and B is that in A size=1000, whereas in B I am creating a draw for each row of my dataframe.
Why do I get different results? Which mistakes I am making? There was a similar question here, but answers are not very helpful.
In the second example, each column is generated independent of the other columns, so each row is no longer a sample from a single multinomial distribution. (Note, for example, that the sums of rows are not 50.) In effect, each column is a binomial distribution that is independent of the other columns.
The second example is equivalent to this (note that I am using the newer random API):
In [142]: rng = np.random.default_rng()
In [143]: p = rng.binomial(50, [0.4, 0.2, 0.4], size=(1000, 3))
In [144]: p
Out[144]:
array([[15, 12, 20],
[22, 6, 22],
[17, 12, 17],
...,
[20, 13, 17],
[27, 9, 24],
[25, 9, 22]])
In [145]: np.cov(p.T)
Out[145]:
array([[13.09065465, 0.15038238, -0.22775375],
[ 0.15038238, 7.85507107, -0.0142022 ],
[-0.22775375, -0.0142022 , 11.87303203]])