rregexapache-sparkillegalargumentexceptionsparklyr

Spark regexp_extract() fails - Regex group count is 0, but the specified group index is 1


I would like to extract the last part of the string (after the last forward slash). When I use the following code it fails with the error:

library(sparklyr)
library(tidyverse)

sc <- spark_connect(method = "databricks")

tibble(my_string = c("aaa/bbb/ccc", "ddd/eee/fff", "ggg/hhh/iii")) %>%
  copy_to(sc, ., "test_regexp_extract", overwrite = TRUE) %>%
  mutate(my_result = regexp_extract(my_string, "(?<=/)[^/]*$")) %>%
  count(my_result)

java.lang.IllegalArgumentException: Regex group count is 0, but the specified group index is 1

show_query() shows the following query:

<SQL>
SELECT `my_result`, COUNT(*) AS `n`
FROM (SELECT `my_string`, regexp_extract(`my_string`, "(?<=/)[^/]*$") AS `my_result`
FROM `test_regexp_extract`) `q01`
GROUP BY `my_result`

I suspect it might be the problem of the forward slash character in the regex. I tried to escape it with a backslash.
Another idea is that for some reason it fails when no match is found. According to the Spark documentation it should work though:

If the regex did not match, or the specified group did not match, an empty string is returned.

Any idea why it fails? In R, the same code works fine.


Solution

  • regexp_extract by default returns the first group. Your regex pattern (?<=/)[^/]*$ has no groups, so this is where the problem lies. You could: