rr-factor

Obtain counts within 3 variable factors


I have a sensitive dataset so I created a mock one here for show.

data.frame(
  Year = c("2010", "2010", "2010", "2011", "2011", "2012", "2013", "2013", "2013"),
  Race = c("White", "White", "Asian", "White", "Black", "Black", "Unknown", "Unknown", "White"),
  Ethnicity = c("Hispanic", "Hispanic", "Not Hispanic", "Hispanic", "Not Hispanic", "Not Hispanic", "Unknown", "Hispanic", "Not Hispanic")
)

 Year    Race    Ethnicity
1 2010   White     Hispanic
2 2010   White     Hispanic
3 2010   Asian Not Hispanic
4 2011   White     Hispanic
5 2011   Black Not Hispanic
6 2012   Black Not Hispanic
7 2013 Unknown      Unknown
8 2013 Unknown     Hispanic
9 2013   White Not Hispanic

In reality, I have a dataset that goes from 2010-2021, so 12 years total. There are also around 6/7 racial categories, and 3 different answers for ethnicity (Hispanic/Latino, not hispanic/latino, unknown).

I am trying to obtain counts for each year, race, and ethnicity (for example, 2010 white hispanic, 2010 white non-hispanic, 2010 asian hispanic, 2010 asian non-hispanic, etc...). I am currently using this function to pull the counts-

raceethfunc <- function(x,y,z){
df %>% filter(Race == x & Ethnicity == y and Year = z) %>%
nrow()
}

H_white2010 <- raceethfunc(x = "White", y = "Hispanic or Latino", z = "2010")
H_white2011 <- raceethfunc(x = "White", y = "Hispanic or Latino", z = "2011")
H_white2012 <- raceethfunc(x = "White", y = "Hispanic or Latino", z = "2012")

Etc...

I am having to do this for each year, race, and ethnicity which means I would have to be copying and pasting like 200+ lines of code to change maybe the year in one line, or the race in another, it is a very inefficient way of going about it.

I am newer to coding but functions especially. I tried using a for() loop but could not understand how to get it to run, any guidance on a loop or a more efficient way to go about this would greatly be appreciated.

PS- This is my first post ever here as well, if I am doing something incorrectly, please let me know how I can better my future posts!


Solution

  • group_by and count from {dplyr} package, like:

    df <- data.frame(
      Year = c("2010", "2010", "2010", "2011", "2011", "2012", "2013", "2013", "2013"),
      Race = c("White", "White", "Asian", "White", "Black", "Black", "Unknown", "Unknown", "White"),
      Ethnicity = c("Hispanic", "Hispanic", "Not Hispanic", "Hispanic", "Not Hispanic", "Not Hispanic", "Unknown", "Hispanic", "Not Hispanic")
    )
    
    df |>
      dplyr::group_by(Year, Race, Ethnicity) |>
      dplyr::count()
    #> # A tibble: 8 × 4
    #> # Groups:   Year, Race, Ethnicity [8]
    #>   Year  Race    Ethnicity        n
    #>   <chr> <chr>   <chr>        <int>
    #> 1 2010  Asian   Not Hispanic     1
    #> 2 2010  White   Hispanic         2
    #> 3 2011  Black   Not Hispanic     1
    #> 4 2011  White   Hispanic         1
    #> 5 2012  Black   Not Hispanic     1
    #> 6 2013  Unknown Hispanic         1
    #> 7 2013  Unknown Unknown          1
    #> 8 2013  White   Not Hispanic     1
    

    Created on 2023-06-30 with reprex v2.0.2