pythonregexpython-2.7python-unicodeunicode-literals

Treat an emoji as one character in a regex


Here's a small example:

reg = ur"((?P<initial>[+\-πŸ‘])(?P<rest>.+?))$"

(In both cases the file has -*- coding: utf-8 -*-)

In Python 2:

re.match(reg, u"πŸ‘hello").groupdict()
# => {u'initial': u'\ud83d', u'rest': u'\udc4dhello'}
# unicode why must you do this

Whereas, in Python 3:

re.match(reg, "πŸ‘hello").groupdict()
# => {'initial': 'πŸ‘', 'rest': 'hello'}

The above behaviour is 100% perfect, but switching to Python 3 is currently not an option. What's the best way to replicate 3's results in 2, that works in both narrow and wide Python builds? The πŸ‘ appears to be coming to me in the format "\ud83d\udc4d", which is what's making this tricky.


Solution

  • In a Python 2 narrow build, non-BMP characters are two surrogate code points, so you can't use them in the [] syntax correctly. u'[πŸ‘] is equivalent to u'[\ud83d\udc4d]', which means "match one of \ud83d or \udc4d. Python 2.7 example:

    >>> u'\U0001f44d' == u'\ud83d\udc4d' == u'πŸ‘'
    True
    >>> re.findall(u'[πŸ‘]',u'πŸ‘')
    [u'\ud83d', u'\udc4d']
    

    To fix in both Python 2 and 3, match u'πŸ‘ OR [+-]. This returns the correct result in both Python 2 and 3:

    #coding:utf8
    from __future__ import print_function
    import re
    
    # Note the 'ur' syntax is an error in Python 3, so properly
    # escape backslashes in the regex if needed.  In this case,
    # the backslash was unnecessary.
    reg = u"((?P<initial>πŸ‘|[+-])(?P<rest>.+?))$"
    
    tests = u'πŸ‘hello',u'-hello',u'+hello',u'\\hello'
    for test in tests:
        m = re.match(reg,test)
        if m:
            print(test,m.groups())
        else:
            print(test,m)
    

    Output (Python 2.7):

    πŸ‘hello (u'\U0001f44dhello', u'\U0001f44d', u'hello')
    -hello (u'-hello', u'-', u'hello')
    +hello (u'+hello', u'+', u'hello')
    \hello None
    

    Output (Python 3.6):

    πŸ‘hello ('πŸ‘hello', 'πŸ‘', 'hello')
    -hello ('-hello', '-', 'hello')
    +hello ('+hello', '+', 'hello')
    \hello None