How to use crosstab (Matlab) to get the correct p-value

Intro. With Matlab, I am running a chi-square test on contingency tables, with two sets of observed counts, cx and cy, by using crosstab.

I have noticed that the p-value varies if, in the calculation of crosstab with crosstab(cx,cx) or crosstab(cx,cy), we consider the entire sets of observed counts, cx and cy, or part of them, i.e. cx(1:nbc) and cy(1:nbc), with nbc being the number of bins we want to consider. Please see the following code and resulting plot for a better understanding:

% Input numbers, where "nb" is the number of bins, 
% in which the data of arrays "x" and "y" will be placed
rng default;  % for reproducibility
a = 0;
b = 100;
nb = 50;

% Create two log-normal distributed random datasets, "x" and "y' 
% (but we can use any randomly distributed data)
x = (b-a).*round(lognrnd(1,1,1000,1)) + a;
y = (b-a).*round(lognrnd(0.88,1.1,1000,1)) + a;

% Counts/frequency of "x" and "y"
cx = histcounts(x,'NumBins',nb);
cy = histcounts(y,'NumBins',nb);

% Instead of comparing the entire binned data, "cx" and "cy", between each other,
% we can compare parts of them, i.e. "cx(1:nbc)" and "cy(1:nbc)", with "nbc" being
% the "number of bins compared". This comparison can be done between "nbc" and the calculated "p-value", resulting from "crosstab(cx(1:nbc),cx(1:nbc))" and from "crosstab(cx(1:nbc),cx(1:nbc))".
% Therefore, by increasing the "number of bins compared", we see how the
% "p-value" changes, for both "crosstab(cx,cx)" and "crosstab(cx,cy)"
% comparisons.
i = 1;
for nbc = 4 : nb
    [~,~,pxx] = crosstab(cx(1:nbc),cx(1:nbc));
    [~,~,pxy] = crosstab(cx(1:nbc),cy(1:nbc));
    A(i,:) = [nbc,pxx];
    B(i,:) = [nbc,pxy];
    i = i + 1;
end

% Plots
hold on
pA = plot(A(:,1),A(:,2),'o-b');
pB = plot(B(:,1),B(:,2),'o-r');
ylabel('p-value')
xlabel('number of bins (nb)')
legend([pA pB],{'crosstab (cx,cx)','crosstab (cx,cy)'})

The resulting plot of "number of bins compared" (nbc) versus the p-value calculated either with crosstab(cx,cy) or with crosstab(cx,cx) is the following:

For both calculations of the chi-square test on contingency tables, either by comparing the two different binned datasets/distributions with crosstab(cx,cy), or by comparing one binned datasets/distributions with itself with crosstab(cx,cx), the p-values goes always to zero if we increase the number of bins nbc (in this case from 4 to 50) that we want to compare. And the p-value is different from zero only if we compare (relatively small) parts of the cx and/or of the cy datasets/distributions.

Two questions.

1 - Therefore, should crosstab be used only with some parts of the cx and cy datasets/distributions, i.e. with cx(1:nbc) and cx(1:nbc)? Otherwise, for relatively large datasets/distributions the p-value will be zero or close to zero.

2 - Shouldn't we get p-value=1 if we compare a binned dataset/distribution with itself, i.e. by crosstab(cx,cx)? Indeed, the null hypothesis should be that two samples come from the same distribution, which is somehow obvious if I compare a distribution cx with itself. But then, if we consider crosstab(cx,cx) with the entire distribution cx, we see that the p-value goes to zero. A non-zero p-value comes if we consider parts of the cx distributions, i.e. by using crosstab(cx(1:nbc),cx(1:nbc)).

Solution

Understandingcrosstab and P-values

When using crosstab(cx, cy) you're comparing the observed frequencies in bins of x and y against the other.

A chi-square test of independence is performed to determine if there is a significant association between the frequencies of x and y across the bins.

When you increase the number of nbc bins, the contingency table becomes sparse and the chi-square test may become sensitive to small deviations, which leads to low p-values.

Additionally, comparing cx to itself using crosstab(cx, cx) should produce a chi-square statistic equal to 0. This is because you are comparing the statistic to itself; where it should not differ. There are some situations where this wont happen due to how crosstab handles data.

Answer

Its not strictly necessary to use only parts of cx and cy but they make the table less sparse and generally give more reliable p-values.

In regards to whether crosstab(cx, cx) should yield a p-value = 1. Short answer, yes.

What I would recommend is to:

Check expected frequencies if they're too low. If this is true, combine bins to increase counts in each bin.
Test different bin counts; ensure that they are well-populated.
Use different testing methods if you encounter problems with chi-square tests consider using exact tests like Fisher's exact test and so on and so forth as shown below.