rstatisticschi-squaredsample-size

Chi-square test in R with unequal sample sizes


A version of this question has been asked a few times but never in the simplest way. Basically, the stats::chisq.test function doesn't work when the sample sizes between the two groups are uneven, despite the fact that chi-square tests are supposed to work with unequal sample sizes, from what I understand.

Here is some test data:

df1 <- data.frame("x" = c("Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No","Yes","No"))
df2 <- data.frame("x" = c("Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","No","Yes","No","Yes","Yes","Yes","No"))

My goal is to see whether there is a difference in the outcome x (i.e., is the outcome "yes" or "no") between two groups of unequal sample size. But when I run the following code:

chisq.test(table(df1$x,df2$x))

I get the following error:

Error in table(df1$x, df2$x) : all arguments must have the same length

Is there a simple fix for this besides creating a new dataframe that has equal sample sizes by adding NAs to the shorter df? Why does this error even exist if chi-square tests can run with unequal sample sizes in the groups being compared?


Solution

  • Ok, so this is a pretty elementary statistical issue but it took a lot of effort for me to figure this out and I think other people might get similarly confused about some of this. This is also quite a fraught issue because it can impact how you interpret your data (the p-values are wrong if you set this up incorrectly!). So it's important to wrap your head around.

    Imagine you have a dataset like this:

    df <- data.frame(group1 = c(rep("hot",9),"cold"),
                     group2 = c(rep("hot",5),rep("cold",5)))
    > df
       group1 group2
    1     hot    hot
    2     hot    hot
    3     hot    hot
    4     hot    hot
    5     hot    hot
    6     hot   cold
    7     hot   cold
    8     hot   cold
    9     hot   cold
    10   cold   cold
    

    You're interested in whether being in group1 and group2 is associated with being hot or cold. If you're like me, you might assume you can do a chi-square test comparing the two groups with:

    m <- chisq.test(df$group1, df$group2)
    m
    

    Resulting in:

        Pearson's Chi-squared test with Yates' continuity correction
    
    data:  df$group1 and df$group2
    X-squared = 0, df = 1, p-value = 1
    

    Those statistics are obviously incorrect. The reason is the structure of your data. Rather than comparing proportions in group1 to proportions in group2, R is doing a sort of rowwise comparison of proportions of people who are hot in group1 and hot in group2 to people who are hot in group1 and cold in group2, etc., an analysis that doesn't make sense given your question. You can see this by calling the observed frequency table that the chi-square test is basing the analysis on:

    m$observed
             df$group2
    df$group1 cold hot
         cold    1   0
         hot     4   5
    

    To answer the question you're actually interested in ("is there an association between group and temperature"), you need to change the structure of the data you are calling in the chi-square function:

    df2 <- df %>% 
      pivot_longer(cols = c("group1","group2"),
                  names_to = "group",
                  values_to = "temperature") %>% 
      arrange(group)
    df2
    # A tibble: 20 × 2
       group  temperature
       <chr>  <chr>      
     1 group1 hot        
     2 group1 hot        
     3 group1 hot        
     4 group1 hot        
     5 group1 hot        
     6 group1 hot        
     7 group1 hot        
     8 group1 hot        
     9 group1 hot        
    10 group1 cold       
    11 group2 hot        
    12 group2 hot        
    13 group2 hot        
    14 group2 hot        
    15 group2 hot        
    16 group2 cold       
    17 group2 cold       
    18 group2 cold       
    19 group2 cold       
    20 group2 cold      
    

    Now we can call the chi-square function correctly, and we see that the observed frequencies are what we expected:

    > p <- chisq.test(df2$temperature, df2$group)
    > p
    
        Pearson's Chi-squared test with Yates' continuity correction
    
    data:  df2$temperature and df2$group
    X-squared = 2.1429, df = 1, p-value = 0.1432
    
    > p$observed
                   df2$group
    df2$temperature group1 group2
               cold      1      5
               hot       9      5
    

    Of course, you don't actually have to reformat your data like this to do the chi-square test. Instead, you can use the helpful code from the other answers above to create a frequency table that has the values you're interested in. But for me at least it was helpful to write all this out to see what you're actually testing. I think in general, if you're running into issues where you're running chi-square tests and R is throwing errors about uneven rows, you might have set up your chi-square function incorrectly.