Given this Oracle version with the goal to convert it so it is usable in Databricks SQL:
WITH pattern AS (
SELECT
/*+ INLINE */
'^([A-Z][A-Z0-9])([A-Z]\d{3,4})([T])?([A-D])?([W]\d)?([A-RT-Z][A-Z0-9])?([V])?([Y])?([S][1-4])?$' AS pat_default,
'\1\2\3\4\7' rep_plain_lot
)
SELECT
t1.lot_number,
CASE
WHEN REGEXP_LIKE (t1.lot_number,(SELECT pat_default FROM rcod))
THEN
REGEXP_REPLACE (t1.lot_number,(SELECT pat_default FROM rcod),(SELECT rep_plain_lot FROM rcod))
ELSE
'no match'
END plain_lot_new
FROM table t1
Expected result, for e.g.:
LOT_NUMBER | PLAIN_LOT
-----------+----------
A1X482 | A1X482
A1X482A | A1X482
Whatever I try it results in "no match". I know that I need to double escape backslashes which I did but this didn't change anything. What I could find out is that the issue seems to be in the second pattern:
([A-Z]\d{3,4})
I would assume mainly the curly braces. My research led me to:
([[:upper:]][[:digit:]]{3,4})
But again, it always results in 'no match'. I'm stuck.
From the Databricks regex_replace
documentation
The regexp string must be a Java regular expression.
When using literals, use
raw-literal
(r
prefix) to avoid escape character pre-processing.
So you should be able to use something like (merging the first 4 capturing groups into one and simplifying some other parts):
SELECT t1.lot_number,
CASE
WHEN REGEXP_LIKE (
t1.lot_number,
r'^([A-Z][A-Z0-9][A-Z]\d{3,4}T?[A-D]?)(W\d)?([A-RT-Z][A-Z0-9])?(V)?Y?(S[1-4])?$'
)
THEN REGEXP_REPLACE (
t1.lot_number,
r'^([A-Z][A-Z0-9][A-Z]\d{3,4}T?[A-D]?)(W\d)?([A-RT-Z][A-Z0-9])?(V)?Y?(S[1-4])?$',
r'\1\4'
)
ELSE 'no match'
END AS plain_lot_new
FROM table_name t1
My research led me to:
([[:upper:]][[:digit:]]{3,4})
Java does not support POSIX character classes and since the documentation states that "The regexp string must be a Java regular expression" this is not going to work (although it will work in Oracle as Oracle does support POSIX character classes).
Assuming ASCII input, the equivalent of the Oracle [A-Z]\d{3,4}
regular expression as a Java regular expression would be one of:
[A-Z]\d{3,4}
(it is valid without any changes)\p{Upper}\p{Digit}{3,4}
\p{Upper}\p{Digit}\p{Digit}\p{Digit}\p{Digit}?
[A-Z]\d\d\d\d?
[A-Z][0-9]{3,4}
[A-Z][0-9][0-9][0-9][0-9]?
(There are probably other variations.)