A version of this question has been asked a few times but never in the simplest way. Basically, the stats::chisq.test
function doesn't work when the sample sizes between the two groups are uneven, despite the fact that chi-square tests are supposed to work with unequal sample sizes, from what I understand.
Here is some test data:
df1 <- data.frame("x" = c("Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No"))
df2 <- data.frame("x" = c("Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","Yes","Yes","No"))
My goal is to see whether there is a difference in the outcome x
(i.e., is the outcome "yes" or "no") between two groups of unequal sample size. But when I run the following code:
chisq.test(table(df1$x,df2$x))
I get the following error:
Error in table(df1$x, df2$x) : all arguments must have the same length
Is there a simple fix for this besides creating a new dataframe that has equal sample sizes by adding NAs to the shorter df? Why does this error even exist if chi-square tests can run with unequal sample sizes in the groups being compared?
Ok, so this is a pretty elementary statistical issue but it took a lot of effort for me to figure this out and I think other people might get similarly confused about some of this. This is also quite a fraught issue because it can impact how you interpret your data (the p-values are wrong if you set this up incorrectly!). So it's important to wrap your head around.
Imagine you have a dataset like this:
df <- data.frame(group1 = c(rep("hot",9),"cold"),
group2 = c(rep("hot",5),rep("cold",5)))
> df
group1 group2
1 hot hot
2 hot hot
3 hot hot
4 hot hot
5 hot hot
6 hot cold
7 hot cold
8 hot cold
9 hot cold
10 cold cold
You're interested in whether being in group1 and group2 is associated with being hot or cold. If you're like me, you might assume you can do a chi-square test comparing the two groups with:
m <- chisq.test(df$group1, df$group2)
m
Resulting in:
Pearson's Chi-squared test with Yates' continuity correction
data: df$group1 and df$group2
X-squared = 0, df = 1, p-value = 1
Those statistics are obviously incorrect. The reason is the structure of your data. Rather than comparing proportions in group1 to proportions in group2, R is doing a sort of rowwise comparison of proportions of people who are hot in group1 and hot in group2 to people who are hot in group1 and cold in group2, etc., an analysis that doesn't make sense given your question. You can see this by calling the observed frequency table that the chi-square test is basing the analysis on:
m$observed
df$group2
df$group1 cold hot
cold 1 0
hot 4 5
To answer the question you're actually interested in ("is there an association between group and temperature"), you need to change the structure of the data you are calling in the chi-square function:
df2 <- df %>%
pivot_longer(cols = c("group1","group2"),
names_to = "group",
values_to = "temperature") %>%
arrange(group)
df2
# A tibble: 20 × 2
group temperature
<chr> <chr>
1 group1 hot
2 group1 hot
3 group1 hot
4 group1 hot
5 group1 hot
6 group1 hot
7 group1 hot
8 group1 hot
9 group1 hot
10 group1 cold
11 group2 hot
12 group2 hot
13 group2 hot
14 group2 hot
15 group2 hot
16 group2 cold
17 group2 cold
18 group2 cold
19 group2 cold
20 group2 cold
Now we can call the chi-square function correctly, and we see that the observed frequencies are what we expected:
> p <- chisq.test(df2$temperature, df2$group)
> p
Pearson's Chi-squared test with Yates' continuity correction
data: df2$temperature and df2$group
X-squared = 2.1429, df = 1, p-value = 0.1432
> p$observed
df2$group
df2$temperature group1 group2
cold 1 5
hot 9 5
Of course, you don't actually have to reformat your data like this to do the chi-square test. Instead, you can use the helpful code from the other answers above to create a frequency table that has the values you're interested in. But for me at least it was helpful to write all this out to see what you're actually testing. I think in general, if you're running into issues where you're running chi-square tests and R is throwing errors about uneven rows, you might have set up your chi-square function incorrectly.