rdplyr

Summary statistics for different levels of group_by()


I'm creating summary tables using group_by() %>% summarise(). My data is in a multi-level format, persons located within households. The goal is to group households by whether they have any animals, and look at the diarrhea prevalence within those households. Here is the data and code that I have so far:

library(tidyverse)

# Create person level data
person_level <- tibble::tribble(
                 ~household_id, ~person_id, ~age, ~diarrhea, ~animal, ~animal_total,
                 1,          1,    2,         1,       1,             5,
                 1,          2,    3,         0,       1,             5,
                 1,          3,   65,        NA,       1,             5,
                 1,          4,   66,        NA,       1,             5,
                 2,          1,    2,         1,       0,             0,
                 2,          2,    4,         0,       0,             0,
                 2,          3,    8,        NA,       0,             0,
                 2,          4,   55,        NA,       0,             0
               )

# "convert" to household level
# Used to get # of animals per household
# Person level shows household 1 has 20 animals
# While household 1 really only has 5 animals
household_level <- person_level %>% 
                    distinct(household_id, .keep_all = TRUE) 

# Create summary table
person_level %>% 
  group_by(animal) %>% 
  summarise(n = n(),
            animal_total = household_level %>% pull(animal_total) %>% sum,
            n_under5 = sum(age <= 5, na.rm = TRUE),
            num_diarrhea = sum(diarrhea, na.rm = TRUE),
            percent_diarrhea = round((num_diarrhea / n_under5), 4)*100) %>% 
  arrange(desc(animal))

I "convert" the person level data into household level to get the number of animals that each household has. If we keep it in the person level, it will show the household has 20 animals (5 animals * 4 people), whereas the household really only has 5 animals.

Within my summary table, I'm able to get the total number of animals among all households that own animals, but this number also shows for the group of people where animal = 0.

Goal: I want the "animal_total" within summarise() to only show for one group, and not the other. In other words, have the sum(animal) show only for the group who owns animals, and have a 0 show for those people that don't have animals in the household.


Solution

  • I wouldn't bother with a separate household_level table. Instead you could mutate things so that just the first member of the household has a value

    person_level %>% 
      group_by(household_id) %>% 
      mutate(first_animal_total = if_else(row_number()==1, animal_total, 0 )) %>% 
      group_by(animal) %>% 
      summarise(n = n(),
                animal_total = sum(first_animal_total),
                n_under5 = sum(age <= 5, na.rm = TRUE),
                num_diarrhea = sum(diarrhea, na.rm = TRUE),
                percent_diarrhea = round((num_diarrhea / n_under5), 4)*100) %>% 
      arrange(desc(animal))
    

    This returns

      animal     n animal_total n_under5 num_diarrhea percent_diarrhea
       <dbl> <int>        <dbl>    <int>        <dbl>            <dbl>
    1      1     4            5        2            1               50
    2      0     4            0        2            1               50