rnested

pivot(), group_by() and summarise() with nested data


Mock data:

df <- structure(list(stop1 = c("New York", "Milwaukee", "New York",
                               "Los Angeles", NA, "Milwaukee"),
                     stop2 = c(NA, "New York", "Los Angeles", 
                               "New York", NA, "New York"),
                     stop1_apple = c("apple", "apple", NA, "apple", NA, "apple"),
                     stop1_pear = c("pear", "pear", "pear", NA, NA, "pear"),
                     stop2_apple = c(NA, "apple", "apple", NA, NA, "apple"), 
                     stop2_pear = c(NA, "pear", "pear", "pear", NA, NA)),
                class = "data.frame", row.names = c(NA, -6L))

df
        stop1       stop2 stop1_apple stop1_pear stop2_apple stop2_pear
1    New York        <NA>       apple       pear        <NA>       <NA>
2   Milwaukee    New York       apple       pear       apple       pear
3    New York Los Angeles        <NA>       pear       apple       pear
4 Los Angeles    New York       apple       <NA>        <NA>       pear
5        <NA>        <NA>        <NA>       <NA>        <NA>       <NA>
6   Milwaukee    New York       apple       pear       apple       <NA>

How to read:

Each row is a traveler. For example, the third row is a traveler who stopped in New York and Los Angeles. During the first stop in New York, she ate a pear. During the second stop in Los Angeles, she ate an apple and a pear.

What I want to do:

Firstly, I want to compute the total number of people who transited in each city. For example, here, 5 people transited in New York (2 at stop1, 3 at stop2) in total. Secondly, I want to compute the number of apples and pears eaten in each city. For example, in New York, 3 apples and 4 pears were eaten.

Desired Output:
df
Location     NStops  NApples   NPears
Los Angeles  2       2         1
Milwaukee    2       2         2
New York     5       3         4 

What I tried:

library(tidyverse)
df %>% pivot_longer(c(stop1, stop2)) %>% 
  rename(Location = value, Stop = name) %>% 
  add_count(Location, name = "NStops") %>% 
  pivot_longer(c(stop1_apple, stop1_pear, stop2_apple, stop2_pear)) %>% 
  group_by(Location, value, NStops) %>% 
  summarise(NFruits = n()) %>% 
  pivot_wider(names_from = value, values_from = NFruits) %>% 
  rename(NApples = apple, NPears = pear) %>% 
  select(-`NA`) %>% 
  filter(!is.na(Location))
  

Output:
Location    NStops NApples NPears
Los Angeles      2       2      3
Milwaukee        2       4      3
New York         5       7      7

The number of stops in each city is correct, but the number of fruits eaten is not as expected, see desired output above. Also note, in reality, I have more than 2 stops, more than 2 fruits, and dozens of cities. Finally, if possible, I would prefer a tidyverse solution.

UPDATE: Alternative data following the-mad-statter's answer

df <- structure(list(Q57 = c("New York", "Milwaukee", "New York",
                               "Los Angeles", NA, "Milwaukee"),
                     Q60 = c(NA, "New York", "Los Angeles", 
                               "New York", NA, "New York"),
                     `Q58/apple` = c("apple", "apple", NA, "apple", NA, "apple"),
                     `Q58/pear` = c("pear", "pear", "pear", NA, NA, "pear"),
                     `Q61/apple` = c(NA, "apple", "apple", NA, NA, "apple"), 
                     `Q61/pear` = c(NA, "pear", "pear", "pear", NA, NA)),
                class = "data.frame", row.names = c(NA, -6L))


df
          Q57         Q60 Q58/apple Q58/pear Q61/apple Q61/pear
1    New York        <NA>     apple     pear      <NA>     <NA>
2   Milwaukee    New York     apple     pear     apple     pear
3    New York Los Angeles      <NA>     pear     apple     pear
4 Los Angeles    New York     apple     <NA>      <NA>     pear
5        <NA>        <NA>      <NA>     <NA>      <NA>     <NA>
6   Milwaukee    New York     apple     pear     apple     <NA>
  

Applying the-mad-statter's solution to alternative data:

library(dplyr)
library(tidyr)
df %>% 
  # Note new variables names <----------------
  rename(
    `Q57/stop` = Q57, 
    `Q60/stop` = Q60
  ) %>% 
  # pivot everything placing values in columns based on suffix
  pivot_longer(
    everything(), 
    names_pattern = "Q\\d+/(.*)", # REGEX WAS EDITED <----------------
    names_to = ".value"
  ) %>% 
  summarize(
    n_stop = n(),
    n_apple = sum(apple == "apple", na.rm = TRUE),
    n_pear = sum(pear == "pear", na.rm = TRUE),
    .by = stop
  ) %>% 
  filter(!is.na(stop)) %>% 
  arrange(stop)

Output:
stop        n_stop n_apple n_pear
Los Angeles      2       2      1
Milwaukee        2       2      2
New York         5       3      4

Solution

  • You should be able to accomplish this with a single pivot_longer().

    Note, your new data made me explicitly realize this approach depends on the columns being in a consistent order and there being the same number of columns in each group to be pivoted. That is, there should be the same number of columns for each fruit as there are destination columns. Column name prefixes and contiguity do not matter; only number, suffix, and order matter. That is, the left-most column for a given fruit will be matched to the left-most destination column. Similarly, the right-most column for a given fruit will be matched to the right-most destination column. Any middle columns will be matched in a similar order.

    Dataset 1

    library(dplyr)
    library(tidyr)
    
    df <- structure(
      list(
        stop1 = c(
          "New York",
          "Milwaukee",
          "New York",
          "Los Angeles",
          NA,
          "Milwaukee"
        ),
        stop2 = c(NA, "New York", "Los Angeles", "New York", NA, "New York"),
        stop1_apple = c("apple", "apple", NA, "apple", NA, "apple"),
        stop1_pear = c("pear", "pear", "pear", NA, NA, "pear"),
        stop2_apple = c(NA, "apple", "apple", NA, NA, "apple"),
        stop2_pear = c(NA, "pear", "pear", "pear", NA, NA)
      ),
      class = "data.frame",
      row.names = c(NA, -6L)
    )
    
    df %>% 
      # add suffixes to stop1, stop2 to provide new name in the pivot
      rename(
        stop1_stop = stop1, 
        stop2_stop = stop2
      ) %>% 
      # pivot everything placing values in columns based on suffix
      pivot_longer(
        everything(), 
        names_pattern = "stop\\d_(.*)", 
        names_to = ".value"
      ) %>% 
      summarize(
        n_stop = n(),
        n_apple = sum(apple == "apple", na.rm = TRUE),
        n_pear = sum(pear == "pear", na.rm = TRUE),
        .by = stop
      ) %>% 
      filter(!is.na(stop)) %>% 
      arrange(stop)
    #> # A tibble: 3 × 4
    #>   stop        n_stop n_apple n_pear
    #>   <chr>        <int>   <int>  <int>
    #> 1 Los Angeles      2       2      1
    #> 2 Milwaukee        2       2      2
    #> 3 New York         5       3      4
    

    Created on 2024-10-09 with reprex v2.1.1

    Dataset 2

    Your edits will work provided the columns are in a consistent order as mentioned above.

    To your second question about counting fruits, we can use across() and count !is.na() instead.

    library(dplyr)
    library(tidyr)
    
    df <- structure(
      list(
        Q57 = c(
          "New York",
          "Milwaukee",
          "New York",
          "Los Angeles",
          NA,
          "Milwaukee"
        ),
        Q60 = c(NA, "New York", "Los Angeles", "New York", NA, "New York"),
        `Q58/apple` = c("apple", "apple", NA, "apple", NA, "apple"),
        `Q58/pear` = c("pear", "pear", "pear", NA, NA, "pear"),
        `Q61/apple` = c(NA, "apple", "apple", NA, NA, "apple"),
        `Q61/pear` = c(NA, "pear", "pear", "pear", NA, NA)
      ),
      class = "data.frame",
      row.names = c(NA, -6L)
    )
    
    df %>% 
      # add suffixes to Q57, Q60 to provide new name in the pivot
      rename(
        `Q57/stop` = Q57, 
        `Q60/stop` = Q60
      ) %>% 
      # pivot everything placing values in columns based on suffix
      pivot_longer(
        everything(), 
        names_pattern = "Q\\d+/(.*)",
        names_to = ".value"
      ) %>% 
      summarize(
        n_stop = n(),
        # instead of counting matches, let us count non-missing
        across(apple:pear, ~ sum(!is.na(.)), .names = "n_{.col}"),
        .by = stop
      ) %>% 
      filter(!is.na(stop)) %>% 
      arrange(stop)
    #> # A tibble: 3 × 4
    #>   stop        n_stop n_apple n_pear
    #>   <chr>        <int>   <int>  <int>
    #> 1 Los Angeles      2       2      1
    #> 2 Milwaukee        2       2      2
    #> 3 New York         5       3      4
    

    Created on 2024-10-09 with reprex v2.1.1