Two binary variables (x and y) form two columns for a number of dates in a pandas Dataframe. I want to calculate a correlation score between x and y that quantifies how correlated x=1 is with y=1 ( x=0 with y=0).
What definition of correlation is appropriate?
Is there a built-in function?
day |
_x |
_ y |
---|---|---|
0 | 1 | 1 |
1 | 1 | 0 |
2 | 0 | 0 |
3 | 1 | 1 |
Explanation: These are two categoricals. say, x = had eggs for breakfast (0 or 1) and y = got a headache (0 or 1). And there data from several days for both x and y. I'm trying to see how 'strongly correlated' having an eggs and having a headache are. I understand that Pearson's correlation is not applicable here. What could be used?
The correlation metric to use in this case is Pearson's rho. Defined for two binary variables, it is also known as Pearson's correlation coeffecient.
rho = (n11*n00 - n10*n01)/sqrt(n11.n10.n01.n00)
where
n11 (n00) = number of rows with x=1(0) and y=1(0) etc.