sqlregexoracledatabricks-sql

How to convert oracle regex pattern into Databricks compatible version?


Given this Oracle version with the goal to convert it so it is usable in Databricks SQL:

WITH pattern AS (
  SELECT
    /*+ INLINE */
    '^([A-Z][A-Z0-9])([A-Z]\d{3,4})([T])?([A-D])?([W]\d)?([A-RT-Z][A-Z0-9])?([V])?([Y])?([S][1-4])?$' AS pat_default,
    '\1\2\3\4\7' rep_plain_lot
)
SELECT 
t1.lot_number,
 CASE
     WHEN REGEXP_LIKE (t1.lot_number,(SELECT pat_default FROM rcod))
     THEN
         REGEXP_REPLACE (t1.lot_number,(SELECT pat_default FROM rcod),(SELECT rep_plain_lot FROM rcod))
     ELSE
      'no match'
    END plain_lot_new
FROM table t1

Expected result, for e.g.:

LOT_NUMBER | PLAIN_LOT
-----------+----------
A1X482     | A1X482
A1X482A    | A1X482

Whatever I try it results in "no match". I know that I need to double escape backslashes which I did but this didn't change anything. What I could find out is that the issue seems to be in the second pattern:

([A-Z]\d{3,4})

I would assume mainly the curly braces. My research led me to:

([[:upper:]][[:digit:]]{3,4})

But again, it always results in 'no match'. I'm stuck.


Solution

  • From the Databricks regex_replace documentation

    The regexp string must be a Java regular expression.

    When using literals, use raw-literal (r prefix) to avoid escape character pre-processing.

    So you should be able to use something like (merging the first 4 capturing groups into one and simplifying some other parts):

    SELECT t1.lot_number,
           CASE
           WHEN REGEXP_LIKE (
                  t1.lot_number,
                  r'^([A-Z][A-Z0-9][A-Z]\d{3,4}T?[A-D]?)(W\d)?([A-RT-Z][A-Z0-9])?(V)?Y?(S[1-4])?$'
                )
           THEN REGEXP_REPLACE (
                  t1.lot_number,
                  r'^([A-Z][A-Z0-9][A-Z]\d{3,4}T?[A-D]?)(W\d)?([A-RT-Z][A-Z0-9])?(V)?Y?(S[1-4])?$',
                  r'\1\4'
                )
           ELSE 'no match'
           END AS plain_lot_new
    FROM   table_name t1
    

    My research led me to:

    ([[:upper:]][[:digit:]]{3,4})
    

    Java does not support POSIX character classes and since the documentation states that "The regexp string must be a Java regular expression" this is not going to work (although it will work in Oracle as Oracle does support POSIX character classes).

    Assuming ASCII input, the equivalent of the Oracle [A-Z]\d{3,4} regular expression as a Java regular expression would be one of:

    (There are probably other variations.)