rpivot-table

How to use a data frame containing counts to run contingency analysis


I know this is a simple question, but I cannot find the answer, which I'm sure is simple.

I have an 8 X 3 data frame: Column 1 is age and column 2 is depression, with both defined as factors. Column 3 is freq, which is numeric and is the frequency count. Here is what the data frame looks like: data frame

I want to run chi square analysis (age, depression), but have not been able to figure out how to indicate to chisq.test or CrossTable that the freq variable represents frequency count for the corresponding cell.

For those who are familiar with SAS, what I want to do corresponds to specifying the WEIGHT variable in PROC FREQ.

I tried the following:

CrossTable(tabledata$age,tabledata$depression)

chisq.test(tabledata$age,tabledata$depression)

I knew those would not work, as I could not figure out how to include tabledata$freq as a count variable.


Solution

  • You need to get your data in a format suitable for the chisq.test function, which accepts either a matrix-like object or two vectors of the same length.

    One way to do this is to pivot the data using tidyr:

    X <- tidyr::pivot_wider(data, names_from=depression, values_from=freq); X
    X
    # A tibble: 4 × 3
        age   `1`   `0`
      <dbl> <dbl> <dbl>
    1    25  1108  5234
    2    29  2086 16824
    3    34  2056 21608
    4    35  1353 15000
    
    (XSQ <- chisq.test(X[,2:3]))
    
        Pearson's Chi-squared test
    
    data:  X[, 2:3]
    X-squared = 508.78, df = 3, p-value < 2.2e-16
    
    XSQ$residuals
                 1         0
    [1,] 18.413377 -6.177473
    [2,]  3.954237 -1.326600
    [3,] -6.907864  2.317508
    [4,] -7.409337  2.485746
    

    Indicating a significant association between depression and age with the younger generation having a higher proportion of depression compared to the others.


    Data:

    data <- data.frame(age=c(25,25,29,29,34,34,35,35),
                       depression=c(1,0,1,0,1,0,1,0),
                       freq=c(1108,5234,2086,16824,2056,21608,1353,15000))