regexoraclelowercasenls-sort

Oracle's puzzling behaviour with NLS_SORT and a simple REGEXP_LIKE()


I ran into a weird behavior from Oracle this morning... And I can't understand why it acts this way from the docs. I'm sorry for the long post, but I want to make sure I'm understood. Oh, and make sure to read the note at the end before answering. :)

The goal of the request is to return rows with 1 or more lowercase characters. For the sake of the example, my table will be:

CREATE TABLE "TEMP_TABLE" 
   ( "VAL" VARCHAR2(4000 BYTE) );
Insert into TEMP_TABLE (VAL) values ('00A00');
Insert into TEMP_TABLE (VAL) values ('00000');
Insert into TEMP_TABLE (VAL) values ('BC000');
Insert into TEMP_TABLE (VAL) values ('ABC00');
Insert into TEMP_TABLE (VAL) values ('AAAAA');
Insert into TEMP_TABLE (VAL) values ('abc00');

Using this SQL query:

select val, 
  case when regexp_like (val, '[a-b]')
       then 'MATCH' 
       else 'NO' 
  end 
  from temp_table;

If the NLS_SORT value of the session is set to BINARY, Oracle returns:

00A00   NO
00000   NO
BC000   NO
ABC00   NO
AAAAA   NO
abc00   MATCH

All good here: the only word containing a lowercase letter matches - the others don't.

But if NLS_SORT is set to FRENCH, the results are less understandable:

00A00   NO
00000   NO
BC000   MATCH
ABC00   MATCH
AAAAA   NO
abc00   MATCH

From what I can deduce, the regexp matches when there are characters other than A.

So my question is: Why would Oracle understand [a-z] as "rows with letters that are not A"?

Notes:

  1. The database is Oracle 10G(r2), and the session's NLS parameters are these:

    NLS_CALENDAR             GREGORIAN
    NLS_COMP                 BINARY
    NLS_CURRENCY             ¿
    NLS_DATE_FORMAT          DD/MM/RR HH24:MI
    NLS_DATE_LANGUAGE        FRENCH
    NLS_DUAL_CURRENCY        ¿
    NLS_ISO_CURRENCY         FRANCE
    NLS_LANGUAGE             FRENCH
    NLS_LENGTH_SEMANTICS     BYTE
    NLS_NCHAR_CONV_EXCP      FALSE
    NLS_NUMERIC_CHARACTERS   , 
    NLS_SORT                 FRENCH_M
    NLS_TERRITORY            FRANCE
    NLS_TIME_FORMAT          HH24:MI:SSXFF
    NLS_TIMESTAMP_FORMAT     DD/MM/RR HH24:MI:SSXFF
    NLS_TIMESTAMP_TZ_FORMAT  DD/MM/RR HH24:MI:SSXFF TZR
    NLS_TIME_TZ_FORMAT       HH24:MI:SSXFF TZR
    
  2. Yes, I could use regexp_like(val, '[[:lower:]]'). But I found out about this later on, and it doesn't explain the weird behaviour.


Solution

  • For better or worse, the sort ordering defined by nls_sort is being used to evaluate the [a-z] regexp. If you insert a,b,c,A,B,C into temp_table and sort it under each setting you'll get the following:

    SQL> alter session set nls_sort=BINARY;
    
    Session altered.
    
    SQL> select val,
      2  case when regexp_like (val, '[a-z]') then 'MATCH' else 'NO' end m
      3  from temp_table order by val;
    
    VAL           M
    ------------------------- -------------------------
    A             NO
    B             NO
    C             NO
    a             MATCH
    b             MATCH
    c             MATCH
    
    6 rows selected.
    
    SQL> alter session set nls_sort=FRENCH;
    
    Session altered.
    
    SQL> select val,
      2  case when regexp_like (val, '[a-z]') then 'MATCH' else 'NO' end m
      3  from temp_table order by val;
    
    VAL           M
    ------------------------- -------------------------
    A             NO
    a             MATCH
    B             MATCH
    b             MATCH
    C             MATCH
    c             MATCH
    
    6 rows selected.
    

    Since the uppercase letters are "interleaved" with the lowercase letters in the French setting it evaluates to true in Oracle's implementation.