Intro. With Matlab, I am running a chi-square test on contingency tables, with two sets of observed counts, cx
and cy
, by using crosstab.
I have noticed that the p-value varies if, in the calculation of crosstab with crosstab(cx,cx)
or crosstab(cx,cy)
, we consider the entire sets of observed counts, cx
and cy
, or part of them, i.e. cx(1:nbc)
and cy(1:nbc)
, with nbc
being the number of bins we want to consider. Please see the following code and resulting plot for a better understanding:
% Input numbers, where "nb" is the number of bins,
% in which the data of arrays "x" and "y" will be placed
rng default; % for reproducibility
a = 0;
b = 100;
nb = 50;
% Create two log-normal distributed random datasets, "x" and "y'
% (but we can use any randomly distributed data)
x = (b-a).*round(lognrnd(1,1,1000,1)) + a;
y = (b-a).*round(lognrnd(0.88,1.1,1000,1)) + a;
% Counts/frequency of "x" and "y"
cx = histcounts(x,'NumBins',nb);
cy = histcounts(y,'NumBins',nb);
% Instead of comparing the entire binned data, "cx" and "cy", between each other,
% we can compare parts of them, i.e. "cx(1:nbc)" and "cy(1:nbc)", with "nbc" being
% the "number of bins compared". This comparison can be done between "nbc" and the calculated "p-value", resulting from "crosstab(cx(1:nbc),cx(1:nbc))" and from "crosstab(cx(1:nbc),cx(1:nbc))".
% Therefore, by increasing the "number of bins compared", we see how the
% "p-value" changes, for both "crosstab(cx,cx)" and "crosstab(cx,cy)"
% comparisons.
i = 1;
for nbc = 4 : nb
[~,~,pxx] = crosstab(cx(1:nbc),cx(1:nbc));
[~,~,pxy] = crosstab(cx(1:nbc),cy(1:nbc));
A(i,:) = [nbc,pxx];
B(i,:) = [nbc,pxy];
i = i + 1;
end
% Plots
hold on
pA = plot(A(:,1),A(:,2),'o-b');
pB = plot(B(:,1),B(:,2),'o-r');
ylabel('p-value')
xlabel('number of bins (nb)')
legend([pA pB],{'crosstab (cx,cx)','crosstab (cx,cy)'})
The resulting plot of "number of bins compared" (nbc
) versus the p-value
calculated either with crosstab(cx,cy)
or with crosstab(cx,cx)
is the following:
For both calculations of the chi-square test on contingency tables, either by comparing the two different binned datasets/distributions with crosstab(cx,cy)
, or by comparing one binned datasets/distributions with itself with crosstab(cx,cx)
, the p-values goes always to zero if we increase the number of bins nbc
(in this case from 4 to 50) that we want to compare. And the p-value is different from zero only if we compare (relatively small) parts of the cx
and/or of the cy
datasets/distributions.
Two questions.
1 - Therefore, should crosstab
be used only with some parts of the cx
and cy
datasets/distributions, i.e. with cx(1:nbc)
and cx(1:nbc)
? Otherwise, for relatively large datasets/distributions the p-value will be zero or close to zero.
2 - Shouldn't we get p-value=1
if we compare a binned dataset/distribution with itself, i.e. by crosstab(cx,cx)
? Indeed, the null hypothesis should be that two samples come from the same distribution, which is somehow obvious if I compare a distribution cx
with itself.
But then, if we consider crosstab(cx,cx)
with the entire distribution cx
, we see that the p-value
goes to zero. A non-zero p-value comes if we consider parts of the cx
distributions, i.e. by using crosstab(cx(1:nbc),cx(1:nbc))
.
Understandingcrosstab
and P-values
When using crosstab(cx, cy)
you're comparing the observed frequencies in bins of x
and y
against the other.
A chi-square test of independence is performed to determine if there is a significant association between the frequencies of x
and y
across the bins.
When you increase the number of nbc
bins, the contingency table becomes sparse and the chi-square test may become sensitive to small deviations, which leads to low p-values.
Additionally, comparing cx
to itself using crosstab(cx, cx)
should produce a chi-square statistic equal to 0. This is because you are comparing the statistic to itself; where it should not differ. There are some situations where this wont happen due to how crosstab
handles data.
Answer
Its not strictly necessary to use only parts of cx
and cy
but they make the table less sparse and generally give more reliable p-values.
In regards to whether crosstab(cx, cx)
should yield a p-value = 1. Short answer, yes.
What I would recommend is to: