I have a series of dataframes, each of which contains a name column and then a text column. I'd like to find duplicates in the text, and then generate a list of all the names that are associated with the duplicate. I can get as far as getting a list of the text duplicates and the number of times each duplicate occurs, but I'm struggling to find a way to get the list of associated names. Here is a reproducible example:
#two separate data frames with name/string
books1 <- data.frame(
name=rep("Ellie", 4),
book= c("Anne of Green Gables", "The Secret Garden", "Alice in Wonderland", "A Little Princess"))
books2 <- data.frame(
name=rep('Jess', 6),
book=c("Harry Potter", "Percy Jackson", "Anne of Green Gables", "Chronicles of Narnia", "Redwall", "A Little Princess"))
#combine into single data frame
books <- bind_rows(books1, books2)
#identify repeats
repeatbooks <- books %>% group_by(book) %>% summarize(n=n())
This gives me:
book n
1 A Little Princess 2
2 Alice in Wonderland 1
3 Anne of Green Gables 2
4 Chronicles of Narnia 1
5 Harry Potter 1
6 Percy Jackson 1
7 Redwall 1
8 The Secret Garden 1
What I'd like is something like:
book n name
1 A Little Princess 2 Ellie, Jess
2 Alice in Wonderland 1 Ellie
3 Anne of Green Gables 2 Ellie, Jess
I'd hoped to do something like this, but it creates multiple rows, rather than grouping the names into a single row
#identify repeats while catching associated names - doesn't group into single column
repeatbooks <- books %>% group_by(book) %>% summarize(n=n(), names=c(paste0(name), ', '))
Do you mean something like below
books %>%
reframe(
n = n(),
name = toString(unique(name)),
.by = book
)
such that
book n name
1 Anne of Green Gables 2 Ellie, Jess
2 The Secret Garden 1 Ellie
3 Alice in Wonderland 1 Ellie
4 A Little Princess 2 Ellie, Jess
5 Harry Potter 1 Jess
6 Percy Jackson 1 Jess
7 Chronicles of Narnia 1 Jess
8 Redwall 1 Jess