rmatchingstringrgreplfuzzy

How to fuzzy match two character vectors in r


Context

I have a df,where the id refers to a different person and the fruits_eat refers to the fruit that person eats. Also, I have a vector fruits_list storing a list of fruits.

Question

I want to generate a new variable fruits_in_list to indicate whether a person ate one and more fruits in the fruits_list, but I don't know how to implement it in R.

What I've done

I checked some answers, but none of them are very relevant to my problem, like.

  1. R Match character vectors
  2. Compare two character vectors in R
  3. https://stackoverflow.com/search?q=How+to+fuzzy+match+two+character+vectors
  4. How to run through list of keyword vectors and fuzzy match them to a different file (R)
  5. Matching strings with abbreviations; fuzzy matching

Reproducible code

fruits_Jack = c('XXappleYYY,lemon,orange,pitaya')
fruits_Rose = c('Navel orange,Blood orange,watermelon,cherry')
fruits_Biden= c('pitaya,cherry,banana')

fruits_list = c('apple', 'lemon', 'orange', 'watermelon', 'peach', 'pear')

df = 
  data.frame(id         = c('Jack', 'Rose', 'Biden'),
             fruits_eat = c(fruits_Jack, fruits_Rose, fruits_Biden))

> df
     id                                  fruits_eat
1  Jack                   apple,lemon,orange,pitaya
2  Rose Navel orange,Blood orange,watermelon,cherry
3 Biden                        pitaya,cherry,banana


Expect output

df_expect = cbind(df, fruits_in_list = c(1, 1, 0))

> df_expect
     id                                  fruits_eat fruits_in_list
1  Jack                   apple,lemon,orange,pitaya              1
2  Rose Navel orange,Blood orange,watermelon,cherry              1
3 Biden                        pitaya,cherry,banana              0

Solution

  • With stringr, use str_detect, or str_count if you want a real count:

    library(stringr)
    library(dplyr)
    df %>% 
      mutate(fruits_in_list = +(str_detect(fruits_eat, paste0(fruits_list, collapse = "|"))),
             count = str_count(fruits_eat, paste0(fruits_list, collapse = "|")))
    
         id                                  fruits_eat fruits_in_list count
    1  Jack              XXappleYYY,lemon,orange,pitaya              1     3
    2  Rose Navel orange,Blood orange,watermelon,cherry              1     3
    3 Biden                        pitaya,cherry,banana              0     0