I'm currently working on an impala db and I'm having some problems with dbplyr's SQL translation.
This is the first iteration of my code which works in R if I collect the table beforehand (which is not something I want to do as it takes forever):
DF2_V1 <- DF1 %>%
filter(indicator != "N") %>%
group_by(id) %>%
filter(!("Y" %in% indicator) | (indicator == "Y"),
!("ANALYSIS" %in% indicator) | (indicator != "RECOMMENDED")) %>%
filter(time1 == min(time1)) %>%
ungroup() %>%
mutate(time_diff = time1 - time2) %>%
select(id,indicator,time1,time2,time_diff %>% show_query() %>% collect()
Essentially the goal of this code is to take DF1 ;
ID | INDICATOR | TIME1 | TIME2 |
---|---|---|---|
1 | Y | ... | ..... |
1 | N | ... | ..... |
1 | RECOMMEND | ... | ..... |
2 | RECOMMEND | ... | ..... |
2 | ANALYSIS | ... | ..... |
And perform the following logic: If a given id has any Y indicator remove the others and keep the earliest iteration of that indicator. If Y is not present, we favor the indicator ANALYSIS instead and take the first iteration (earliest time), if not we take RECOMMENDED. This code works fine in R (when collecting DF1 beforehand) and does what I want however when the DF1 table is uncollected and we're performing a SQL query I get the following error:
Error in new_result(connection@ptr, statement, immediate) :
nanodbc/nanodbc.cpp:1412: 00000: [RStudio][ImpalaODBC] (360)
Syntax error occurred during query execution: [HY000] :
AnalysisException: Syntax error in line 32:WHEN ('Y' IN indicator) THEN 'Y'
^
Encountered: IDENTIFIER
Expected: (CAUSED BY: Exception: Syntax error
I'm still quite new to db queries and I was not sure what to make of this so I tried rewriting the code in R using SQL script in dbplyr which made some minor modifications hoping to clarify my logic:
DF2_V2 <- DF1 %>%
filter(indicator != "NULL") %>%
group_by(id) %>%
mutate(indicator = case_when(
sql("'Y' IN indicator") ~ "Y",
sql("('ANALYSIS' IN indicator) AND (indicator != 'RECOMMENDED')") ~ "ANALYSIS",
TRUE ~ "RECOMMENDED")) %>%
filter(time1 == min(time1)) %>%
mutate(time_diff = time1 - time2) %>% select(...) %>% collect()
This presented with the same error. I also tried my queries directly in the db using the show_query translation to see if it was a problem with R's connection but inevitably came to the same conclusion. Not sure if my code itself is faulty or the translation into SQL is being messed up but I cant seem to find the problem.
I don't know if it's a bug in dbplyr
, but forcing the parens should work:
DF2_V1 <- DF1 %>%
filter(indicator != "N") %>%
group_by(id) %>%
filter(!("Y" %in% (indicator)) | (indicator == "Y"),
!("ANALYSIS" %in% (indicator)) | (indicator != "RECOMMENDED")) %>%
filter(time1 == min(time1)) %>%
ungroup() %>%
mutate(time_diff = time1 - time2) %>%
select(id,indicator,time1,time2,time_diff %>% show_query() %>% collect()
(This might be specific to the impala backend/driver, not sure.)