sqlrregexdplyrimpala

function to replace R's grepl in SQL


Little bit of context: I'm working with an impala database and am currently querying a large table with a particular column (i'll call it body) which essentially has paragraphs of text inside it. Because of the nature of this query, I need to embed a text miner (using regex) onto that body column which will give off certain indicators if something is found. Using R, I can do this just fine with a grepl or a simple stri_detect_regex function of the stringi package, the caveat is it forces me to collect my table beforehand since these substring detecting functions are not translatable directly into sql.

What I'm looking for is a replacement function for R's grepl which will spit out true/false indicators to work with my code below (which does what I need) such that I can keep the query uncollected to optimize running time:

DF_2 <- DF_1 %>% 
  collect() %>%
  mutate(body = tolower(body),
         indicator= ifelse(stri_detect_regex(body,"REGEX1"),"Y",
                        ifelse(stri_detect_regex(body,'REGEX2'),"ANALYSIS",
                               ifelse(stri_detect_regex(body,"REGEX3"),"RECOMMENDED","NULL")))) %>% select(...)

Here is an example DF just to get some context:

DF1

body
text1
text2
text3

Desirable output:

body indicator
text1 Y
text2 NULL
text3 ANALYSIS

It's just a very simple nested ifelse statement that starts checking for the first regex match if so then prints Y, if not then checks for the second regex match and so on...


Here is my attempted adaptation (which did not do what I wanted):

DF_2 <- DF_1 %>%
  mutate(body = tolower(body),
         indicator= ifelse(sql("body RLIKE 'REGEX1'"),"Y",
                                     ifelse(sql("body RLIKE 'REGEX2'"),"ANALYSIS",
                                            ifelse(sql("body RLIKE 'REGEX3'"),"RECOMMENDED","NULL")))) %>% select(...)

For some reason the output column (indicator) came out to be a column with variable type boolean which has me believe the function I tried using to replace grepl is not doing the same task as the grepl function was. Is there a fix to this or maybe a more appropriate function to use that would accomplish what I need?


Solution

  • With @Gregor Thomas's comment I was able to modify my code that basically does what I needed:

    DF2 <- DF1 %>%
      mutate(body = tolower(body),
             indicator = case_when(sql("lower(body) RLIKE 'regex1'") ~ "NULL",
                                            sql("lower(body) RLIKE 'regex2'") ~ "Y",
                                            sql("lower(body) RLIKE 'regex3'") ~ "ANALYSIS",
                                            sql("lower(body) RLIKE 'regex4'") ~ "RECOMMENDED",
                                            TRUE ~ "NULL")) %>% select(...)
    

    for some reason it didnt treat the first mutate I did to the body column so I had to repeat it when doing the RLIKE but this is exactly what I was hoping for (a query that doesnt have to force collection beforehand)