rdplyrstrsplitvocabulary

Dictionary-like matching on string in R


I have a dataframe in which a string variable is an informal list of elements, that can be split on a symbol. I would like to make operaion on these elements on the basis of another dataset.

e.g. task: Calculate the sum of the elements

df_1 <- data.frame(element=c(1:2),groups=c("A,B,C","A,D"))
df_2 <- data.frame(groups=c("A","B","C","D"), values=c(1:4))

desired <- data.frame(element=c(1:2),groups=c("A,B,C","A,D"),sum=c(6,5))

Solution

  • An option would be to split the 'groups' by the delimiter , to expand the rows with separate_rows, do a join with the key/val dataset ('df_2'), groued by 'element', get the sum of 'values'

    library(tidyverse)
    df_1 %>% 
      separate_rows(groups) %>% 
      left_join(df_2) %>% 
      group_by(element) %>%
      summarise(groups = toString(groups), sum = sum(values))
    # A tibble: 2 x 3
    #  element groups    sum
    #    <int> <chr>   <int>
    #1       1 A, B, C     6
    #2       2 A, D        5
    

    Or another option with base R would be to use a named key/value vector 'nm1') to change the values in the splitted list elements, sum and assign it to new column in 'df_1'

    nm1 <- setNames(df_2$values, df_2$groups)
    df_1$sum <- sapply(strsplit(as.character(df_1$groups), ","), function(x) sum(nm1[x]))