sql regex oracle-database databricks-sql

How to convert oracle regex pattern into Databricks compatible version?

Given this Oracle version with the goal to convert it so it is usable in Databricks SQL:

WITH pattern AS (
  SELECT
    /*+ INLINE */
    '^([A-Z][A-Z0-9])([A-Z]\d{3,4})([T])?([A-D])?([W]\d)?([A-RT-Z][A-Z0-9])?([V])?([Y])?([S][1-4])?$' AS pat_default,
    '\1\2\3\4\7' rep_plain_lot
)
SELECT 
t1.lot_number,
 CASE
     WHEN REGEXP_LIKE (t1.lot_number,(SELECT pat_default FROM rcod))
     THEN
         REGEXP_REPLACE (t1.lot_number,(SELECT pat_default FROM rcod),(SELECT rep_plain_lot FROM rcod))
     ELSE
      'no match'
    END plain_lot_new
FROM table t1

Expected result, for e.g.:

LOT_NUMBER | PLAIN_LOT
-----------+----------
A1X482     | A1X482
A1X482A    | A1X482

Whatever I try it results in "no match". I know that I need to double escape backslashes which I did but this didn't change anything. What I could find out is that the issue seems to be in the second pattern:

([A-Z]\d{3,4})

I would assume mainly the curly braces. My research led me to:

([[:upper:]][[:digit:]]{3,4})

But again, it always results in 'no match'. I'm stuck.

Solution

From the Databricks regex_replace documentation

The regexp string must be a Java regular expression.

When using literals, use raw-literal (r prefix) to avoid escape character pre-processing.

So you should be able to use something like (merging the first 4 capturing groups into one and simplifying some other parts):

SELECT t1.lot_number,
       CASE
       WHEN REGEXP_LIKE (
              t1.lot_number,
              r'^([A-Z][A-Z0-9][A-Z]\d{3,4}T?[A-D]?)(W\d)?([A-RT-Z][A-Z0-9])?(V)?Y?(S[1-4])?$'
            )
       THEN REGEXP_REPLACE (
              t1.lot_number,
              r'^([A-Z][A-Z0-9][A-Z]\d{3,4}T?[A-D]?)(W\d)?([A-RT-Z][A-Z0-9])?(V)?Y?(S[1-4])?$',
              r'\1\4'
            )
       ELSE 'no match'
       END AS plain_lot_new
FROM   table_name t1

My research led me to:
([[:upper:]][[:digit:]]{3,4})

Java does not support POSIX character classes and since the documentation states that "The regexp string must be a Java regular expression" this is not going to work (although it will work in Oracle as Oracle does support POSIX character classes).

Assuming ASCII input, the equivalent of the Oracle [A-Z]\d{3,4} regular expression as a Java regular expression would be one of:

[A-Z]\d{3,4} (it is valid without any changes)
\p{Upper}\p{Digit}{3,4}
\p{Upper}\p{Digit}\p{Digit}\p{Digit}\p{Digit}?
[A-Z]\d\d\d\d?
[A-Z][0-9]{3,4}
[A-Z][0-9][0-9][0-9][0-9]?

(There are probably other variations.)