I executed the same statement in MySQL and Oracle, but they returned different results. Regular expression "/?" means to return zero or more "/", but why would MySQL return me a "products", is the MySQL implementation wrong?
SELECT
REGEXP_SUBSTR('http://www.example.com/products',
'http://([[:alnum:]]+\.?){3,4}/?') "REGEXP_SUBSTR"
FROM DUAL;
Oracle Output:
REGEXP_SUBSTR
http://www.example.com/
MySQL Output:
REGEXP_SUBSTR
http://www.example.com/products
I queried the documentation for Oracle and MySQL, and they were interested in "? The definition of " is to match zero or one character.MySQL says they implemented regular support with ICU libraries.But the output of ICU is "http://www.example.com/".
int main(int argc, char *argv[]) {
uint32_t flags = 0;
flags |= UREGEX_UNIX_LINES;
flags |= UREGEX_CASE_INSENSITIVE;
const char *c_source_char = "http://www.example.com/products";
const char *c_pat_char = "http://([[:alnum:]]+\\.?){3}/?";;
UErrorCode status = U_ZERO_ERROR;
URegularExpression *regexp = uregex_openC(c_pat_char, 0, NULL, &status);
UChar ustr[100];
u_uastrcpy(ustr, c_source_char);
uregex_setText(regexp, ustr, -1, &status);
if (uregex_find(regexp, 0, &status)) {
int32_t start = uregex_start(regexp, 0, &status);
int32_t end = uregex_end(regexp, 0, &status);
if(end - start > 0) {
std::string res(c_source_char+start, c_source_char+end);
std::cout << res << std::endl;
}
printf("Found match at %d-%d\n", start, end);
}
uregex_close(regexp);
return 0;
}
The problem is not with /?
, it is with the preceding \.?
.
In MySQL, the regular expression syntax documentation states:
To use a literal instance of a special character in a regular expression, precede it by two backslash (
\
) characters. The MySQL parser interprets one of the backslashes, and the regular expression library interprets the other. For example, to match the string 1+2 that contains the special + character, only the last of the following regular expressions is the correct one:mysql> SELECT REGEXP_LIKE('1+2', '1+2'); -> 0 mysql> SELECT REGEXP_LIKE('1+2', '1\+2'); -> 0 mysql> SELECT REGEXP_LIKE('1+2', '1\\+2'); -> 1
So:
SELECT REGEXP_SUBSTR(
'http://www.example.com/products',
'http://([[:alnum:]]+\.?){3,4}/?'
) AS single_slash,
REGEXP_SUBSTR(
'http://www.example.com/products',
'http://([[:alnum:]]+\\.?){3,4}/?'
) AS double_slash,
REGEXP_SUBSTR(
'http://www.example.com/products',
'http://([[:alnum:]]+[.]?){3,4}/?'
) AS character_class
FROM DUAL;
Outputs:
single_slash | double_slash | character_class |
---|---|---|
http://www.example.com/products | http://www.example.com/ | http://www.example.com/ |
In the first pattern \.
is the same as .
and is matching any character so [[:alnum:]]+\.?
matches www.
then example.
then com/
then products
for the 4th occurrence. To match only a .
character, you want either \\.
or [.]
.
In Oracle, you want to use \.
or [.]
and not \\.
to match the .
character. (So if you want to use the same expression in both SQL dialects then use http://([[:alnum:]]+[.]?){3,4}/?
.)