[SOLVED] Regex matching with ASCII and extended Unicode characters

Regex matching with ASCII and extended Unicode characters

I'm using the following regex in Oracle DB under CLOB datatype column:

SELECT
CASE
    WHEN COUNT(*) > 0 THEN 1
    ELSE 0
END AS contains_extended_unicode 
FROM my_table 
WHERE REGEXP_LIKE(TO_CHAR(my_column), '[^[:print:]\r\n\t]');

The query should return as 0 when data is purely with printable ASCII characters from keyboard. If any characters beyond, i.e •, the bullet list from MS Word will be treated as extended Unicode character, and query return as 1.

Currently, if the data contains ASCII characters with more than 1 line, Oracle still treat as extended Unicode character, thus query return as 1.

Sample data:

qwerty
qwerty

Solution

Oracle does not support PERL-like shorthand character classes in regular expression character classes.

In a regular expression character class, Oracle does not evaluate \n\r\t as newline, carriage return and tab but as those literal characters. So [^[:print:]\r\n\t] will match all characters that are not either visible characters (the POSIX character class [:print:]) or \ or r or \ or n or \ or t.

What you want to do is use the CHR function to generate the characters and concatenate them into your regular expression:

SELECT CASE
       WHEN COUNT(*) > 0 THEN 1
       ELSE 0
       END AS contains_extended_unicode 
FROM   my_table 
WHERE  REGEXP_LIKE(
         my_column,
         '[^[:print:]' || CHR(10) || CHR(13) || CHR(9) || ']'
       );

If you want to match any character that is not an ASCII printable character (ASCII Codes 9, 10, 13 and 32-127) then you can use:

SELECT CASE
       WHEN COUNT(*) > 0 THEN 1
       ELSE 0
       END AS contains_extended_unicode 
FROM   my_table 
WHERE  REGEXP_LIKE(
         my_column,
         '[^' || CHR(32) || '-' || CHR(127) || CHR(10) || CHR(13) || CHR(9) || ']'
       );

fiddle