rtreefilteringgrepl

Filter a dataframe based on a selection of string values found across multiple columns


I have a huge database on replanting projects using different species of trees, and I want to create a new database selecting only the species I am interested in. I have ~70 words (i.e. species) I want to select from the dataframe, across 3 different columns. I'm trying to use the 'grepl' function, but I'm lost on adding multiple columns with the same selection of key words. The words/species can occur inconjunction with other species not targeted by my 70 words, not sure if that is the issue.

Essentially, I am trying to build code that finds any instance of the 70 words across the dataset, and selects them (or alternatively removes any row that does not include any of those 70), in order to avoid using command-f for 70+ words across a grand total of 16 datasets with thousands of rows.

Any help is much appreciated.

First I tried filtering the dataset with the 'grepl' function on the first column, called 'species' for the ~70 words, however it printed rows that do not include the 70 words/species. This is the output of the following:

> dput(head(NCR[,c("REGION", "COMPONENT","SPECIES")]))
structure(list(REGION = c("NCR", "NCR", "NCR", "NCR", "NCR", 
"NCR"), COMPONENT = c("Urban", "Urban", "Urban", "Urban", "Urban", 
"Urban"), SPECIES = c("Narra", "Banaba, Caballero, Ilang ilang, Molave, Yellow alder,Bougainvilla,", 
"Bignay, Camachile, Nangka, Sampaloc, Santol,Narra,kalumpit,langka,lipote,guyabano,palawan cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,bignay,molave", 
"Sansevieria, Spider lily, Yellow morado, Zigzag, Sansevieria, Spider lily, Yellow morado, Zigzag\nSansevieria, Spider lily, Yellow morado, Zigzag", 
"Banaba, Caballero, Ilang ilang, Narra, Tuai,", "Acacia, Acapulco, Antipolo, Bagras, Balete, Bougainvilla, Dao, Fire tree, Golden shower, Ipil, Kalumpit, Kamagong, Lipote, Manila palm, Molave, Nangka, Neem tree, Supa, Tuai, Yakal,mabolo,tabebuia,langka,bitaog,narracamachile,banaba,ilang\nilang,guyabano"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
key_terms <- c('mangrove','magrove','avicennia','bungalon','api-api','piapi','piape','miapi','myapi','miape','Rhizophora','bakau','Bakauan', 'bakaw','bakhaw','bacau','bacaw','Sonneratia','pagatpat','pedada','Nypa','nipa','nypa','sasa','Bruguiera','pototan','busain','langarai','Camptostemon','gapas','Ceriops','baras','tungog','tangal','Excoecaria','lipata','buta','Heritiera','dungon','Aegiceras','saging','Lumnitziera','tubao','culasi','kulasi','Osbornia','tawalis','bunot','Pemphis','bantigi','Scyphiphora','nilad','Xylocarpus','tabigi','tabige','piagao','piag-ao','tubo tubo','tubo-tubo','saging-saging','moluccensis','granatum','hydrophyllaceae','adicula','octodonta','corniculatum','littoralis','agallocha','tagal','decandra','philippinensis','parviflora','fruticans','caseolaris','ovata','alba' )
new_NCR <- filter(NCR, grepl(paste(key_terms, collapse='|'), SPECIES))
new_NCR

Solution

  • You should be able to use dplyr::if_any within your dplyr::filter() here.

    You didn't have any of the values in key_terms in you sample data, so 0 rows were returned. I tweaked the key_terms to include "Narra", which is found in a few rows

    key_terms <- c('mangrove', 'alba', 'Narra')
    
    filter(NCR, if_any(REGION:SPECIES, 
                       ~grepl(paste(key_terms, collapse='|'), .x)))
    

    Output:

    #1 NCR    Urban     "Narra"                                                                                                                                             
    #2 NCR    Urban     "Bignay, Camachile, Nangka, Sampaloc, #Santol,Narra,kalumpit,langka,lipote,guyabano,palawan #cherry,banaba,mahogany,Golden\nshower,Mangqa,Bayabas,big…
    #3 NCR    Urban     "Banaba, Caballero, Ilang ilang, Narra, Tuai,"