df <- structure(list(stop1 = c("New York", "Milwaukee", "New York",
"Los Angeles", NA, "Milwaukee"),
stop2 = c(NA, "New York", "Los Angeles",
"New York", NA, "New York"),
stop1_apple = c("apple", "apple", NA, "apple", NA, "apple"),
stop1_pear = c("pear", "pear", "pear", NA, NA, "pear"),
stop2_apple = c(NA, "apple", "apple", NA, NA, "apple"),
stop2_pear = c(NA, "pear", "pear", "pear", NA, NA)),
class = "data.frame", row.names = c(NA, -6L))
df
stop1 stop2 stop1_apple stop1_pear stop2_apple stop2_pear
1 New York <NA> apple pear <NA> <NA>
2 Milwaukee New York apple pear apple pear
3 New York Los Angeles <NA> pear apple pear
4 Los Angeles New York apple <NA> <NA> pear
5 <NA> <NA> <NA> <NA> <NA> <NA>
6 Milwaukee New York apple pear apple <NA>
Each row is a traveler. For example, the third row is a traveler who stopped in New York and Los Angeles. During the first stop in New York, she ate a pear. During the second stop in Los Angeles, she ate an apple and a pear.
Firstly, I want to compute the total number of people who transited in each city. For example, here, 5 people transited in New York (2 at stop1, 3 at stop2) in total. Secondly, I want to compute the number of apples and pears eaten in each city. For example, in New York, 3 apples and 4 pears were eaten.
df
Location NStops NApples NPears
Los Angeles 2 2 1
Milwaukee 2 2 2
New York 5 3 4
library(tidyverse)
df %>% pivot_longer(c(stop1, stop2)) %>%
rename(Location = value, Stop = name) %>%
add_count(Location, name = "NStops") %>%
pivot_longer(c(stop1_apple, stop1_pear, stop2_apple, stop2_pear)) %>%
group_by(Location, value, NStops) %>%
summarise(NFruits = n()) %>%
pivot_wider(names_from = value, values_from = NFruits) %>%
rename(NApples = apple, NPears = pear) %>%
select(-`NA`) %>%
filter(!is.na(Location))
Output:
Location NStops NApples NPears
Los Angeles 2 2 3
Milwaukee 2 4 3
New York 5 7 7
The number of stops in each city is correct, but the number of fruits eaten is not as expected, see desired output above. Also note, in reality, I have more than 2 stops, more than 2 fruits, and dozens of cities. Finally, if possible, I would prefer a tidyverse
solution.
df <- structure(list(Q57 = c("New York", "Milwaukee", "New York",
"Los Angeles", NA, "Milwaukee"),
Q60 = c(NA, "New York", "Los Angeles",
"New York", NA, "New York"),
`Q58/apple` = c("apple", "apple", NA, "apple", NA, "apple"),
`Q58/pear` = c("pear", "pear", "pear", NA, NA, "pear"),
`Q61/apple` = c(NA, "apple", "apple", NA, NA, "apple"),
`Q61/pear` = c(NA, "pear", "pear", "pear", NA, NA)),
class = "data.frame", row.names = c(NA, -6L))
df
Q57 Q60 Q58/apple Q58/pear Q61/apple Q61/pear
1 New York <NA> apple pear <NA> <NA>
2 Milwaukee New York apple pear apple pear
3 New York Los Angeles <NA> pear apple pear
4 Los Angeles New York apple <NA> <NA> pear
5 <NA> <NA> <NA> <NA> <NA> <NA>
6 Milwaukee New York apple pear apple <NA>
library(dplyr)
library(tidyr)
df %>%
# Note new variables names <----------------
rename(
`Q57/stop` = Q57,
`Q60/stop` = Q60
) %>%
# pivot everything placing values in columns based on suffix
pivot_longer(
everything(),
names_pattern = "Q\\d+/(.*)", # REGEX WAS EDITED <----------------
names_to = ".value"
) %>%
summarize(
n_stop = n(),
n_apple = sum(apple == "apple", na.rm = TRUE),
n_pear = sum(pear == "pear", na.rm = TRUE),
.by = stop
) %>%
filter(!is.na(stop)) %>%
arrange(stop)
Output:
stop n_stop n_apple n_pear
Los Angeles 2 2 1
Milwaukee 2 2 2
New York 5 3 4
You should be able to accomplish this with a single pivot_longer()
.
Note, your new data made me explicitly realize this approach depends on the columns being in a consistent order and there being the same number of columns in each group to be pivoted. That is, there should be the same number of columns for each fruit as there are destination columns. Column name prefixes and contiguity do not matter; only number, suffix, and order matter. That is, the left-most column for a given fruit will be matched to the left-most destination column. Similarly, the right-most column for a given fruit will be matched to the right-most destination column. Any middle columns will be matched in a similar order.
library(dplyr)
library(tidyr)
df <- structure(
list(
stop1 = c(
"New York",
"Milwaukee",
"New York",
"Los Angeles",
NA,
"Milwaukee"
),
stop2 = c(NA, "New York", "Los Angeles", "New York", NA, "New York"),
stop1_apple = c("apple", "apple", NA, "apple", NA, "apple"),
stop1_pear = c("pear", "pear", "pear", NA, NA, "pear"),
stop2_apple = c(NA, "apple", "apple", NA, NA, "apple"),
stop2_pear = c(NA, "pear", "pear", "pear", NA, NA)
),
class = "data.frame",
row.names = c(NA, -6L)
)
df %>%
# add suffixes to stop1, stop2 to provide new name in the pivot
rename(
stop1_stop = stop1,
stop2_stop = stop2
) %>%
# pivot everything placing values in columns based on suffix
pivot_longer(
everything(),
names_pattern = "stop\\d_(.*)",
names_to = ".value"
) %>%
summarize(
n_stop = n(),
n_apple = sum(apple == "apple", na.rm = TRUE),
n_pear = sum(pear == "pear", na.rm = TRUE),
.by = stop
) %>%
filter(!is.na(stop)) %>%
arrange(stop)
#> # A tibble: 3 × 4
#> stop n_stop n_apple n_pear
#> <chr> <int> <int> <int>
#> 1 Los Angeles 2 2 1
#> 2 Milwaukee 2 2 2
#> 3 New York 5 3 4
Created on 2024-10-09 with reprex v2.1.1
Your edits will work provided the columns are in a consistent order as mentioned above.
To your second question about counting fruits, we can use across()
and count !is.na()
instead.
library(dplyr)
library(tidyr)
df <- structure(
list(
Q57 = c(
"New York",
"Milwaukee",
"New York",
"Los Angeles",
NA,
"Milwaukee"
),
Q60 = c(NA, "New York", "Los Angeles", "New York", NA, "New York"),
`Q58/apple` = c("apple", "apple", NA, "apple", NA, "apple"),
`Q58/pear` = c("pear", "pear", "pear", NA, NA, "pear"),
`Q61/apple` = c(NA, "apple", "apple", NA, NA, "apple"),
`Q61/pear` = c(NA, "pear", "pear", "pear", NA, NA)
),
class = "data.frame",
row.names = c(NA, -6L)
)
df %>%
# add suffixes to Q57, Q60 to provide new name in the pivot
rename(
`Q57/stop` = Q57,
`Q60/stop` = Q60
) %>%
# pivot everything placing values in columns based on suffix
pivot_longer(
everything(),
names_pattern = "Q\\d+/(.*)",
names_to = ".value"
) %>%
summarize(
n_stop = n(),
# instead of counting matches, let us count non-missing
across(apple:pear, ~ sum(!is.na(.)), .names = "n_{.col}"),
.by = stop
) %>%
filter(!is.na(stop)) %>%
arrange(stop)
#> # A tibble: 3 × 4
#> stop n_stop n_apple n_pear
#> <chr> <int> <int> <int>
#> 1 Los Angeles 2 2 1
#> 2 Milwaukee 2 2 2
#> 3 New York 5 3 4
Created on 2024-10-09 with reprex v2.1.1