I am trying to detect detect if certain combinations of patterns are present/absent in one variable in a dataframe.
There are some questions that are similar, but I could not find one that answers exactly what I am trying to achieve.
I am trying to find:
I still can not find a fix but I will share what I did so far, to get your guidance:
Create a sample dataframe
x=structure(list(Sources = structure(c(1L, 7L, 6L, 8L, 9L, 4L,
3L, 5L, 2L), .Label =
c("Found in all nutritious foods in moderate amounts: pork, whole grain foods or enriched breads and cereals, legumes, nuts and seeds",
"Found only in fruits and vegetables, especially citrus fruits, vegetables in the cabbage family, cantaloupe, strawberries, peppers, tomatoes, potatoes, lettuce, papayas, mangoes, kiwifruit",
"Leafy green vegetables and legumes, seeds, orange juice, and liver; now added to most refined grains",
"Meat, fish, poultry, vegetables, fruits",
"Meat, poultry, fish, seafood, eggs, milk and milk products; not found in plant foods",
"Meat, poultry, fish, whole grain foods, enriched breads and cereals, vegetables (especially mushrooms, asparagus, and leafy green vegetables), peanut butter",
"Milk and milk products; leafy green vegetables; whole grain foods, enriched breads and cereals",
"Widespread in foods", "Widespread in foods; also produced in intestinal tract by bacteria"
), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
this code detects presence of any of the 2 specified strings (?i) means ignore case.
x$present = str_detect(x$Sources, "(?i)Vegetables|(?i)Meat")
# but it does not work with "and"
x$present =str_detect(x$Sources, "(?i)Vegetables&(?i)Meat")
#here it gives FALSE for all, my expected output is to return TRUE for those that contain both words
This one works by filtering the desired combination:
x %>% filter (str_detect(x$Sources, "(?i)Vegetables") & str_detect(x$Sources, "(?i)Meat"))
x %>% filter (str_detect(x$Sources, "(?i)Vegetables") & !str_detect(x$Sources, "(?i)Meat")) #does not contain meat
x %>% filter (!str_detect(x$Sources, "(?i)Meat") & str_detect(x$Sources, "(?i)Vegetables") & str_detect(x$Sources, "(?i)Grain"))
Finally, I found this package which looks like it can do the job, but it only works with vectors, is there a way to make it work for variables in dataframe? like using lapply or something to return another variable with True/False?
library(sjmisc)
str_contains(x$Sources, "Meat", ignore.case = T)
Use mutate
with str_detect
to create the new column:
library(tidyverse)
x %>%
mutate(pattern_detected =
str_detect(Sources, "(?i)Vegetables") &
str_detect(Sources, "(?i)Meat"))