I would like to extract the last part of the string (after the last forward slash). When I use the following code it fails with the error:
library(sparklyr)
library(tidyverse)
sc <- spark_connect(method = "databricks")
tibble(my_string = c("aaa/bbb/ccc", "ddd/eee/fff", "ggg/hhh/iii")) %>%
copy_to(sc, ., "test_regexp_extract", overwrite = TRUE) %>%
mutate(my_result = regexp_extract(my_string, "(?<=/)[^/]*$")) %>%
count(my_result)
java.lang.IllegalArgumentException: Regex group count is 0, but the specified group index is 1
show_query()
shows the following query:
<SQL>
SELECT `my_result`, COUNT(*) AS `n`
FROM (SELECT `my_string`, regexp_extract(`my_string`, "(?<=/)[^/]*$") AS `my_result`
FROM `test_regexp_extract`) `q01`
GROUP BY `my_result`
I suspect it might be the problem of the forward slash character in the regex. I tried to escape it with a backslash.
Another idea is that for some reason it fails when no match is found. According to the Spark documentation it should work though:
If the regex did not match, or the specified group did not match, an empty string is returned.
Any idea why it fails? In R, the same code works fine.
regexp_extract
by default returns the first group. Your regex pattern (?<=/)[^/]*$
has no groups, so this is where the problem lies. You could:
()
, e.g. (?<=/)([^/]*)$
0
as the 3rd parameter to the function: regexp_extract(my_string, "(?<=/)[^/]*$", 0)
. This way, the full regex match will be returned as opposed to only the first group.