matlabpivot-tablep-valuechi-squared# How to use crosstab (Matlab) to get the correct p-value

**Intro.** With Matlab, I am running a chi-square test on contingency tables, with two sets of observed counts, `cx`

and `cy`

, by using crosstab.

I have noticed that the p-value varies if, in the calculation of crosstab with `crosstab(cx,cx)`

or `crosstab(cx,cy)`

, we consider the entire sets of observed counts, `cx`

and `cy`

, or part of them, i.e. `cx(1:nbc)`

and `cy(1:nbc)`

, with `nbc`

being the number of bins we want to consider. Please see the following code and resulting plot for a better understanding:

```
% Input numbers, where "nb" is the number of bins,
% in which the data of arrays "x" and "y" will be placed
rng default; % for reproducibility
a = 0;
b = 100;
nb = 50;
% Create two log-normal distributed random datasets, "x" and "y'
% (but we can use any randomly distributed data)
x = (b-a).*round(lognrnd(1,1,1000,1)) + a;
y = (b-a).*round(lognrnd(0.88,1.1,1000,1)) + a;
% Counts/frequency of "x" and "y"
cx = histcounts(x,'NumBins',nb);
cy = histcounts(y,'NumBins',nb);
% Instead of comparing the entire binned data, "cx" and "cy", between each other,
% we can compare parts of them, i.e. "cx(1:nbc)" and "cy(1:nbc)", with "nbc" being
% the "number of bins compared". This comparison can be done between "nbc" and the calculated "p-value", resulting from "crosstab(cx(1:nbc),cx(1:nbc))" and from "crosstab(cx(1:nbc),cx(1:nbc))".
% Therefore, by increasing the "number of bins compared", we see how the
% "p-value" changes, for both "crosstab(cx,cx)" and "crosstab(cx,cy)"
% comparisons.
i = 1;
for nbc = 4 : nb
[~,~,pxx] = crosstab(cx(1:nbc),cx(1:nbc));
[~,~,pxy] = crosstab(cx(1:nbc),cy(1:nbc));
A(i,:) = [nbc,pxx];
B(i,:) = [nbc,pxy];
i = i + 1;
end
% Plots
hold on
pA = plot(A(:,1),A(:,2),'o-b');
pB = plot(B(:,1),B(:,2),'o-r');
ylabel('p-value')
xlabel('number of bins (nb)')
legend([pA pB],{'crosstab (cx,cx)','crosstab (cx,cy)'})
```

The resulting plot of "number of bins compared" (`nbc`

) versus the `p-value`

calculated either with `crosstab(cx,cy)`

or with `crosstab(cx,cx)`

is the following:

For both calculations of the chi-square test on contingency tables, either by comparing the two different binned datasets/distributions with `crosstab(cx,cy)`

, or by comparing one binned datasets/distributions with itself with `crosstab(cx,cx)`

, the p-values goes always to zero if we increase the number of bins `nbc`

(in this case from 4 to 50) that we want to compare. And the p-value is different from zero only if we compare (relatively small) parts of the `cx`

and/or of the `cy`

datasets/distributions.

**Two questions.**

1 - Therefore, should `crosstab`

be used only with some parts of the `cx`

and `cy`

datasets/distributions, i.e. with `cx(1:nbc)`

and `cx(1:nbc)`

? Otherwise, for relatively large datasets/distributions the p-value will be zero or close to zero.

2 - Shouldn't we get `p-value=1`

if we compare a binned dataset/distribution with itself, i.e. by `crosstab(cx,cx)`

? Indeed, the null hypothesis should be that two samples come from the same distribution, which is somehow obvious if I compare a distribution `cx`

with itself.
But then, if we consider `crosstab(cx,cx)`

with the entire distribution `cx`

, we see that the `p-value`

goes to zero. A non-zero p-value comes if we consider parts of the `cx`

distributions, i.e. by using `crosstab(cx(1:nbc),cx(1:nbc))`

.

Solution

**Understanding crosstab and P-values**

When using `crosstab(cx, cy)`

you're comparing the observed frequencies in bins of `x`

and `y`

against the other.

A chi-square test of independence is performed to determine if there is a significant association between the frequencies of `x`

and `y`

across the bins.

When you increase the number of `nbc`

bins, the contingency table becomes sparse and the chi-square test may become sensitive to small deviations, which leads to low p-values.

Additionally, comparing `cx`

to itself using `crosstab(cx, cx)`

should produce a chi-square statistic equal to 0. This is because you are comparing the statistic to itself; where it should not differ. There are some situations where this wont happen due to how `crosstab`

handles data.

**Answer**

Its not strictly necessary to use only parts of `cx`

and `cy`

but they make the table less sparse and generally give more reliable p-values.

In regards to whether `crosstab(cx, cx)`

should yield a p-value = 1. Short answer, yes.

What I would recommend is to:

- Check expected frequencies if they're too low. If this is true, combine bins to increase counts in each bin.
- Test different bin counts; ensure that they are well-populated.
- Use different testing methods if you encounter problems with chi-square tests consider using exact tests like Fisher's exact test and so on and so forth as shown below.

- How to implement Gray scale morphology to detect round object on gray scale image in matlab?
- clean morphological operation
- Closing a gap in time series
- Efficient segment boundary marking after segmentation of an image
- What is the algorithm of Skeleton
- Select largest object in an image
- How to find the maximum of opening in all orientations at a point in an image?
- How to Extract part of Image using Matlab
- HitMiss transformation in matlab
- dotted output after skeletonization in matlab
- relabeling pixels based on distance between object's centerline and boundary
- Morphological dilation on Greyscale image using a neighborhood of size n x n in MATLAB
- Counting the largest connecting objects/component by using bwlabel
- Mask image with static threshold in matlab
- How to chose structuring element?
- Segmenting a grayscale image
- Replacing connected regions by their skeleton
- Is there an image processing function to get a skeleton of a binary image in MATLAB
- How to preallocate an array of class in MATLAB?
- strjoin compatibility in Matlab and Octave
- MATLAB: How To Efficiently Remove NaN Elements from Matrix
- How to focus a matlab uifigure
- Will this technique reduce the MATLAB realtime workshop coder compilation time?
- Finding "all solutions" of x in A.x = b in the finite field domain
- C/C++ to MATLAB compiler/converter
- p-values for all pairs between two matrices to achieve matlab's corr function
- Convert Excel Column Number to Column Name in Matlab
- python use of corrcoeff to achieve matlab's corr function
- Continously Publish Matlab Text Output to a Website
- multi-level/recursive assignment subsasgn example in matlab