I am looking to summarize a customer transactional dataframe to a single row per customer using dplyr. For continuous variables this is simple - use sum / mean etc. For categorical variables I would like to choose the "Mode" - i.e. the most commonly encountered value within the group and do this across multiple columns e.g.:
For example to take the table Cus1
Cus <- data.frame(Customer = c("C-01", "C-01", "C-02", "C-02", "C-02", "C-02", "C-03", "C-03"),
Product = c("COKE", "COKE", "FRIES", "SHAKE", "BURGER", "BURGER", "CHICKEN", "FISH"),
Store = c("NYC", "NYC", "Chicago", "Chicago", "Detroit", "Detroit", "LA", "San Fran")
)
And generate the table Cus_Summary:
Cus_Summary <- data.frame(Customer = c("C-01", "C-02", "C-03"),
Product = c("COKE", "BURGER", "CHICKEN"),
Store = c("NYC", "Chicago", "LA")
)
Are there any packages that can provide this function? Or has anyone a function that can be applied across multiple columns within a dplyr step?
I am not worried about smart ways to handle ties - any output for a tie will suffice (although any suggestions as to how to best handle ties would be interesting and appreciated).
How about this?
Cus %>%
group_by(Customer) %>%
summarise(
Product = first(names(sort(table(Product), decreasing = TRUE))),
Store = first(names(sort(table(Store), decreasing = TRUE))))
## A tibble: 3 x 3
# Customer Product Store
# <fct> <chr> <chr>
#1 C-01 COKE NYC
#2 C-02 BURGER Chicago
#3 C-03 CHICKEN LA
Note that in the case of ties this selects the first entry in alphabetical order.
To randomly select an entry from tied top frequency entries we could define a custom function
top_random <- function(x) {
tbl <- sort(table(x), decreasing = T)
top <- tbl[tbl == max(tbl)]
return(sample(names(top), 1))
}
Then the following randomly selects one of the tied top entries:
Cus %>%
group_by(Customer) %>%
summarise(
Product = top_random(Product),
Store = top_random(Store))