sqlrdplyrimpaladbplyr

Impala Query(dbplyr) Error : Encountered Identifier : expected: (


I'm currently working on an impala db and I'm having some problems with dbplyr's SQL translation.

This is the first iteration of my code which works in R if I collect the table beforehand (which is not something I want to do as it takes forever):

DF2_V1 <- DF1 %>%
  filter(indicator != "N") %>%
  group_by(id) %>%
  filter(!("Y" %in% indicator) | (indicator == "Y"),
         !("ANALYSIS" %in% indicator) | (indicator != "RECOMMENDED")) %>%
  filter(time1 == min(time1)) %>%
  ungroup() %>%
  mutate(time_diff = time1 - time2) %>%
  select(id,indicator,time1,time2,time_diff %>% show_query() %>% collect()

Essentially the goal of this code is to take DF1 ;

ID INDICATOR TIME1 TIME2
1 Y ... .....
1 N ... .....
1 RECOMMEND ... .....
2 RECOMMEND ... .....
2 ANALYSIS ... .....

And perform the following logic: If a given id has any Y indicator remove the others and keep the earliest iteration of that indicator. If Y is not present, we favor the indicator ANALYSIS instead and take the first iteration (earliest time), if not we take RECOMMENDED. This code works fine in R (when collecting DF1 beforehand) and does what I want however when the DF1 table is uncollected and we're performing a SQL query I get the following error:

Error in new_result(connection@ptr, statement, immediate) :
nanodbc/nanodbc.cpp:1412: 00000: [RStudio][ImpalaODBC] (360)
Syntax error occurred during query execution: [HY000] :
AnalysisException: Syntax error in line 32:

WHEN ('Y' IN indicator) THEN 'Y'
^
Encountered: IDENTIFIER
Expected: (

CAUSED BY: Exception: Syntax error

I'm still quite new to db queries and I was not sure what to make of this so I tried rewriting the code in R using SQL script in dbplyr which made some minor modifications hoping to clarify my logic:

DF2_V2 <- DF1 %>%
  filter(indicator != "NULL") %>%
  group_by(id) %>%
  mutate(indicator = case_when(
    sql("'Y' IN indicator") ~ "Y",
    sql("('ANALYSIS' IN indicator) AND (indicator != 'RECOMMENDED')") ~ "ANALYSIS",
    TRUE ~ "RECOMMENDED")) %>%
  filter(time1 == min(time1)) %>%
  mutate(time_diff = time1 - time2) %>% select(...) %>% collect()

This presented with the same error. I also tried my queries directly in the db using the show_query translation to see if it was a problem with R's connection but inevitably came to the same conclusion. Not sure if my code itself is faulty or the translation into SQL is being messed up but I cant seem to find the problem.


Solution

  • I don't know if it's a bug in dbplyr, but forcing the parens should work:

    DF2_V1 <- DF1 %>%
      filter(indicator != "N") %>%
      group_by(id) %>%
      filter(!("Y" %in% (indicator)) | (indicator == "Y"),
             !("ANALYSIS" %in% (indicator)) | (indicator != "RECOMMENDED")) %>%
      filter(time1 == min(time1)) %>%
      ungroup() %>%
      mutate(time_diff = time1 - time2) %>%
      select(id,indicator,time1,time2,time_diff %>% show_query() %>% collect()
    

    (This might be specific to the impala backend/driver, not sure.)