rtidyversesummarize

How to force 0-count summary cells and add totals to columns and rows (R tidyverse)


Okay, here is a sample of what my data looks like:

GRADE_LVL COURSE_NAME COURSE_CODE STUDENT_GENDER ETHNICITY OUTCOME
12 Physics 03165 Male White Pass
12 Physics 03165 Female White Pass
12 Physics 03165 Nonbinary Black or African American Pass
9 Algebra I 02052 Female Multiracial Pass
10 Algebra I 02052 Female White Fail

I need to report on 3 genders (male, female, nonbinary), and 7 ethnicities (Hispanic or Latino, American Indian or Alaska Native, Asian, Native Hawaiian/Other Pacific Islander, Black or African American, White, and Multiracial).

I am trying to write a function in R that will produce a table of the demographic facts for a set of parameters passed to it. I want the output of the function to produce a tibble that looks like this :

High School Students Who Passed Algebra I

Hispanic Native American Asian Black White Multiracial Total
Male 0 0 7 2 13 4 26
Female 1 0 3 1 12 3 20
Nonbinary 0 0 0 0 1 0 1
Total 1 0 10 3 26 7 47

Notes: This contains only dummy values that aren't related to those in my sample data snippet above. I have shortened the column names to save space on the screen.

Here is the code I have so far:

data <- dbGetQuery('secrets')
highSchool = c('09', '10', '11', '12')
passingOnly <- quo(OUTCOME == 'Pass')
algebra1 <- quo(COURSE_CODE == '02052')
ethnicCategories <- factor(c(
                              'Hispanic or Latino', 
                              'American Indian or Alaska Native',
                              'Asian',
                              'Native Hawaiian/Other Pacific Islander',
                              'Black or African American',
                              'White',
                              'Multiracial'
                            ))
genderCategories <- factor(c('Female', 'Male', 'Nonbinary'))

demographicBreakout <- function(filterConditions, gradeLevels) {
      data %>%
      filter( {{ filterConditions }} ) %>%
      filter(GRADE_LVL %in% gradeLevels) %>%
      select(STUDENT_GENDER, ETHNIC_DESC) %>%
      group_by(STUDENT_GENDER, ETHNIC_DESC) %>%
      summarise(COUNT = n()) %>%
      pivot_wider(
        names_from = ETHNIC_DESC, 
        values_from = COUNT, 
        values_fill = 0
      ) %>%
      rename_at("STUDENT_GENDER", ~"Gender")
}

report <- demographicBreakout(
          filterConditions = !!quo(!!algebra1 & !!passingOnly),
          gradeLevels = highSchool
      )

This code produces a tibble that looks like this:

Gender Hispanic Asian Black White Multiracial
Female 2 15 2 26 9
Male 12 23 1 43 11

This is looking good so far, but I need to have all demographic categories present in the table even if the count is 0. I tried adding the following code snippet to my demographicBreakout function in between the summarize and pivot_wider statements:

      complete(
        ETHNIC_DESC = ethnicCategories,
        STUDENT_GENDER = genderCategories, 
        fill = list(COUNT = 0)
      ) %>%

Adding this code caused the following error:

Error in `reframe()`:
ℹ In argument: `complete(data = pick(everything()), ..., fill = fill, explicit = explicit)`.
ℹ In group 1: `STUDENT_GENDER = Female`.
Caused by error in `dplyr::full_join()`:
! Join columns in `y` must be present in the data.
✖ Problem with `STUDENT_GENDER`.

I have been unable to get this error resolved. In addition to this complete() statement, I also need to run something similar on the rows so that nonbinary counts appear. On top of all that, I still need to add in the row and column totals.

Any help getting over my current hurdle would be greatly appreciated.


Solution

  • The function can be simplified a lot, when we combine the factor conversion idea from Stefan's (now-deleted) answer, and the .drop = FALSE argument for the count() function:

    ethnicCategories <- c('Hispanic or Latino', 
                        'American Indian or Alaska Native',
                        'Asian',
                        'Native Hawaiian/Other Pacific Islander',
                        'Black or African American',
                        'White',
                        'Multiracial')
    genderCategories <- c('Female', 'Male', 'Nonbinary')
    
    demographicBreakout <- function(filterConditions, gradeLevels) {
      data |>
        filter({{filterConditions}} & GRADE_LVL %in% gradeLevels) |>
        mutate(Gender = factor(STUDENT_GENDER, levels = genderCategories),
                e = factor(ETHNIC_DESC, levels = ethnicCategories)) |>
        count(Gender, e, .drop = FALSE) |>
        pivot_wider(names_from = e, values_from = n)
    }
    

    Example:

    demographicBreakout(
              filterConditions = !!quo(COURSE_CODE == '02052'),
              gradeLevels = c('09', '10', '11', '12')
          ) |> print(width = Inf)
    

    Output:

    # A tibble: 3 × 8
      Gender    `Hispanic or Latino` `American Indian or Alaska Native` Asian
      <fct>                    <int>                              <int> <int>
    1 Female                       0                                  0     0
    2 Male                         0                                  0     0
    3 Nonbinary                    0                                  0     0
      `Native Hawaiian/Other Pacific Islander` `Black or African American` White
                                         <int>                       <int> <int>
    1                                        0                           0     1
    2                                        0                           0     0
    3                                        0                           0     0
      Multiracial
            <int>
    1           0
    2           0
    3           0