postgresqlencodebytea

like / ilike with encode(bytea) not work with unicode


I have a system where I need to compare bytea values ​​(case insensitive in this case). For this purpose, I use encode to convert the bytea values ​​into strings which I then compare. This works as long as there is no unicode character in the array.

select 
    encode(cid, 'hex'),     
    cid = '\x3030386132323035'::bytea a,
    encode(cid, 'escape') = encode('\x3030386132323035'::bytea, 'escape') b,
    encode(cid, 'escape') like encode('\x3030386132323035'::bytea, 'escape') c,
    encode(cid, 'escape') ilike encode('\x3030386132323035'::bytea, 'escape') d
from history
encode a b c d
3030386132323035 true true true true

When there are unicode characters in bytea, however, like/ilike do not work

select 
    encode(cid, 'hex'),     
    cid = '\x303030d0a2d095d0a1d0a231'::bytea a,
    encode(cid, 'escape') = encode('\x303030d0a2d095d0a1d0a231'::bytea, 'escape') b,
    encode(cid, 'escape') like encode('\x303030d0a2d095d0a1d0a231'::bytea, 'escape') c,
    encode(cid, 'escape') ilike encode('\x303030d0a2d095d0a1d0a231'::bytea, 'escape') d
from history
encode a b c d
303030d0a2d095d0a1d0a231 true true false false

Can anyone explain what is causing this behavior?


Solution

  • That's because of the special meaning of backslashes in LIKE, ILIKE and SIMILAR TO syntax:

    The default escape character is the backslash but a different one can be selected by using the ESCAPE clause. To match the escape character itself, write two escape characters.

    Which doesn't play well with escape format:

    escape
    The escape format converts zero bytes and bytes with the high bit set into octal escape sequences (\nnn), and it doubles backslashes. Other byte values are represented literally.

    On a simpler example:

    select '\a' like '\a';--yields False
    

    You can fix that by disabling or changing the default escape character: demo at db<>fiddle

    select    cid            as as_default
     , encode(cid, 'escape') as as_escape
     , cid = '\x303030d0a2d095d0a1d0a231'::bytea as a
     , encode(cid,'escape')
       = encode('\x303030d0a2d095d0a1d0a231'::bytea,'escape') 
       as b
     , encode(cid,'escape')
       like  encode('\x303030d0a2d095d0a1d0a231'::bytea,'escape') 
       as c
     , encode(cid,'escape')
       ilike encode('\x303030d0a2d095d0a1d0a231'::bytea,'escape') 
       as d
     , encode(cid,'escape')
       like  encode('\x303030d0a2d095d0a1d0a231'::bytea,'escape') 
       escape '' --escaping disabled
       as e
     , encode(cid,'escape')
       ilike encode('\x303030d0a2d095d0a1d0a231'::bytea,'escape') 
       escape '!' --escape character changed from default `\` to `!`
       as f
    from history
    where cid='\x303030d0a2d095d0a1d0a231'::bytea;
    
    as_default as_escape a b c d e f
    \x303030d0a2d095d0a1d0a231 000\320\242\320\225\320\241\320\2421 T T f f T T

    Simpler test:

    select '\a' like '\a';           --False
    select '\a' like '\a' escape ''; --True
    select '\a' like '\a' escape '!';--True