mysqloracle-databaseicuregexp-substr

regexp_substr in Oracle and Mysql


I executed the same statement in MySQL and Oracle, but they returned different results. Regular expression "/?" means to return zero or more "/", but why would MySQL return me a "products", is the MySQL implementation wrong?

SELECT
  REGEXP_SUBSTR('http://www.example.com/products',
                'http://([[:alnum:]]+\.?){3,4}/?') "REGEXP_SUBSTR"
  FROM DUAL;

Oracle Output:
REGEXP_SUBSTR
http://www.example.com/

MySQL Output:
REGEXP_SUBSTR
http://www.example.com/products

I queried the documentation for Oracle and MySQL, and they were interested in "? The definition of " is to match zero or one character.MySQL says they implemented regular support with ICU libraries.But the output of ICU is "http://www.example.com/".

int main(int argc, char *argv[]) {
    uint32_t flags = 0;
    flags |= UREGEX_UNIX_LINES;
    flags |= UREGEX_CASE_INSENSITIVE;
    const char *c_source_char = "http://www.example.com/products";
    const char *c_pat_char = "http://([[:alnum:]]+\\.?){3}/?";;
    UErrorCode status = U_ZERO_ERROR;
    URegularExpression *regexp = uregex_openC(c_pat_char, 0, NULL, &status);
    UChar ustr[100];
    u_uastrcpy(ustr, c_source_char);
    uregex_setText(regexp, ustr, -1, &status);
    if (uregex_find(regexp, 0, &status)) {
        int32_t start = uregex_start(regexp, 0, &status);
        int32_t end = uregex_end(regexp, 0, &status);
        if(end - start > 0) {
            std::string res(c_source_char+start, c_source_char+end);
            std::cout << res << std::endl;
        }
        
        printf("Found match at %d-%d\n", start, end);
    }
    uregex_close(regexp);
    return 0;
}

Solution

  • The problem is not with /?, it is with the preceding \.?.

    In MySQL, the regular expression syntax documentation states:

    To use a literal instance of a special character in a regular expression, precede it by two backslash (\) characters. The MySQL parser interprets one of the backslashes, and the regular expression library interprets the other. For example, to match the string 1+2 that contains the special + character, only the last of the following regular expressions is the correct one:

    mysql> SELECT REGEXP_LIKE('1+2', '1+2');                       -> 0
    mysql> SELECT REGEXP_LIKE('1+2', '1\+2');                      -> 0
    mysql> SELECT REGEXP_LIKE('1+2', '1\\+2');                     -> 1
    

    So:

    SELECT REGEXP_SUBSTR(
             'http://www.example.com/products',
             'http://([[:alnum:]]+\.?){3,4}/?'
           ) AS single_slash,
           REGEXP_SUBSTR(
             'http://www.example.com/products',
             'http://([[:alnum:]]+\\.?){3,4}/?'
           ) AS double_slash,
           REGEXP_SUBSTR(
             'http://www.example.com/products',
             'http://([[:alnum:]]+[.]?){3,4}/?'
           ) AS character_class
    FROM   DUAL;
    

    Outputs:

    single_slash double_slash character_class
    http://www.example.com/products http://www.example.com/ http://www.example.com/

    In the first pattern \. is the same as . and is matching any character so [[:alnum:]]+\.? matches www. then example. then com/ then products for the 4th occurrence. To match only a . character, you want either \\. or [.].

    MySQL fiddle

    In Oracle, you want to use \. or [.] and not \\. to match the . character. (So if you want to use the same expression in both SQL dialects then use http://([[:alnum:]]+[.]?){3,4}/?.)

    Oracle fiddle