[SOLVED] Spark regexp_extract() fails - Regex group count is 0, but the specified group index is 1

Spark regexp_extract() fails - Regex group count is 0, but the specified group index is 1

I would like to extract the last part of the string (after the last forward slash). When I use the following code it fails with the error:

library(sparklyr)
library(tidyverse)

sc <- spark_connect(method = "databricks")

tibble(my_string = c("aaa/bbb/ccc", "ddd/eee/fff", "ggg/hhh/iii")) %>%
  copy_to(sc, ., "test_regexp_extract", overwrite = TRUE) %>%
  mutate(my_result = regexp_extract(my_string, "(?<=/)[^/]*$")) %>%
  count(my_result)

java.lang.IllegalArgumentException: Regex group count is 0, but the specified group index is 1

show_query() shows the following query:

<SQL>
SELECT `my_result`, COUNT(*) AS `n`
FROM (SELECT `my_string`, regexp_extract(`my_string`, "(?<=/)[^/]*$") AS `my_result`
FROM `test_regexp_extract`) `q01`
GROUP BY `my_result`

I suspect it might be the problem of the forward slash character in the regex. I tried to escape it with a backslash.
Another idea is that for some reason it fails when no match is found. According to the Spark documentation it should work though:

If the regex did not match, or the specified group did not match, an empty string is returned.

Any idea why it fails? In R, the same code works fine.

Solution

regexp_extract by default returns the first group. Your regex pattern (?<=/)[^/]*$ has no groups, so this is where the problem lies. You could:

create a group using (), e.g. (?<=/)([^/]*)$
add 0 as the 3rd parameter to the function: regexp_extract(my_string, "(?<=/)[^/]*$", 0). This way, the full regex match will be returned as opposed to only the first group.