Assuming a data frame df
with several columns, of which a "description" field,
and assuming a set of keywords stored within a separate vector keywords
, what is the best practice to:
df
for each of these keywords, named accordingly ;df
?e.g.
(df <- data.frame(
ID = letters[1:10],
DESCRIPTION = c("blue", "red", "this was green", "this was red", "blue and red", "green", NA, "green", "green, blue, and red", NA)
))
ID DESCRIPTION
1 a blue
2 b red
3 c this was green
4 d this was red
5 e blue and red
6 f green
7 g <NA>
8 h green
9 i green, blue, and red
10 j <NA>
keywords <- c("blue", "red", "green")
would return
ID DESCRIPTION blue red green
1 a blue 1 0 0
2 b red 0 1 0
3 c this was green 0 0 1
4 d this was red 0 1 0
5 e blue and red 1 1 0
6 f green 0 0 1
7 g <NA> 0 0 0
8 h green 0 0 1
9 i green, blue, and red 1 1 1
10 j <NA> 0 0 0
Prefereable using base
R or dplyr
(eg. avoiding data.table
).
Note: the answer needs to be scalable (many possible keywords).
An approach using unnest
, str_count
and pivot_wider
, assuming all occurrences of the keywords within all strings per row have to be counted. Using a slightly modified data set to show multiple counts
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(nms = list(!!keywords)) %>%
unnest(nms) %>%
rowwise() %>%
mutate(values = str_count(DESCRIPTION, nms),
values = replace(values, is.na(values), 0)) %>%
ungroup() %>%
pivot_wider(names_from=nms, values_from=values)
# A tibble: 10 × 5
ID DESCRIPTION blue red green
<chr> <chr> <dbl> <dbl> <dbl>
1 a blue blue 2 0 0
2 b red 0 1 0
3 c this was green 0 0 1
4 d this was red 0 1 0
5 e blue and red 1 1 0
6 f green 0 0 1
7 g NA 0 0 0
8 h green 0 0 1
9 i green, blue, and red 1 1 1
10 j NA 0 0 0
or with base R sapply
and str_count
library(stringr)
cbind(df, sapply(keywords, function(x){
res <- str_count(df$DESCRIPTION, x)
replace(res, is.na(res), 0)}))
ID DESCRIPTION blue red green
1 a blue blue 2 0 0
2 b red 0 1 0
3 c this was green 0 0 1
4 d this was red 0 1 0
5 e blue and red 1 1 0
6 f green 0 0 1
7 g <NA> 0 0 0
8 h green 0 0 1
9 i green, blue, and red 1 1 1
10 j <NA> 0 0 0
df <- structure(list(ID = c("a", "b", "c", "d", "e", "f", "g", "h",
"i", "j"), DESCRIPTION = c("blue blue", "red", "this was green",
"this was red", "blue and red", "green", NA, "green", "green, blue, and red",
NA)), row.names = c(NA, -10L), class = "data.frame")