oraclelexeroracle11gr2oracle-text

Oracle Text Contains and technical content


I'am searching for the technical word "AN-XYZ99". So I use

SELECT *
FROM foo
WHERE CONTAINS(bar, 'AN{-}XYZ99') > 0

but I get also results like "FO-XYZ99" or "BAR-XYZ99". What can I do to ensure the expected result?

I used

BEGIN
    CTX_DDL.CREATE_PREFERENCE('FOO','BASIC_LEXER');
    CTX_DDL.SET_ATTRIBUTE('FOO', 'ALTERNATE_SPELLING', 'GERMAN');
    CTX_DDL.SET_ATTRIBUTE('FOO', 'COMPOSITE', 'GERMAN');
    CTX_DDL.SET_ATTRIBUTE('FOO', 'MIXED_CASE', 'NO');
END;

Sample data from column "bar" (VARCHAR2(4000)):

"unbekannt Stadt Text: AN-XYZ99 << foobar Straße 31.12.2017 Datum Host 20160101 foo"
"unbekannt Stadt Text: FO-XYZ99 << foobar Straße 31.12.2017 Datum Host 20160101 bar"
"unbekannt Stadt Text: BAR-XYZ99 << foobar Straße 31.12.2017 Datum Host 20160101 bla"

With the Statement above I would like the first row as output but I get the second and third row as well.

Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production


Solution

  • First you must define hyphen as a printjoin in your lexer.

    check it with

    select IXV_ATTRIBUTE, IXV_VALUE from CTXSYS.CTX_INDEX_VALUES where IXV_CLASS =  'LEXER';
    
    IXV_ATTRIBUTE                  IXV_VALUE     
    -----------------------------------------
    PRINTJOINS                     _$%&-         
    NUMJOIN                        .              
    NUMGROUP                       .              
    WHITESPACE                     ,= 
    

    Then you may (after re-creating index with this lexer) validate that the tokens are as expected: (your table would vary based on the index name; check all tables like 'DR$%$I')

    select TOKEN_TEXT from DR$TEXTIDX_IDX$I where TOKEN_TEXT like '%-XYZ99';
    TOKEN_TEXT                                                     
    ----------------------------------------------------------------
    AN-XYZ99                                                         
    BAR-XYZ99                                                        
    FO-XYZ99
    

    Now you may query for the search string.

    Aparently you must escape the hyphen as BAR-XYZ99 will find rows with BAR not containing XYZ99 ; although the documentation of hyphen with no space is a bit different.

    SELECT SCORE(1),txt
    FROM textidx
    WHERE  CONTAINS(txt, 'BAR-XYZ99',1) > 0; 
    
      SCORE(1) TXT                                                                                
    ---------- ------------------------------------------------------------------------------------
             4 unbekannt Stadt Text: FO-XYZ99 << foobar Straße 31.12.2017 Datum Host 20160101 bar
    

    For some reason (I'm on 11.2.0.2.0) the escaping with curly braces doesn't work (returns no match), but using backslash is fine.

    SELECT SCORE(1),txt
    FROM textidx
    WHERE  CONTAINS(txt, 'BAR\-XYZ99',1) > 0;  
    
      SCORE(1) TXT                                                                                
    ---------- ------------------------------------------------------------------------------------
             4 unbekannt Stadt Text: BAR-XYZ99 << foobar Straße 31.12.2017 Datum Host 20160101 bla