pythonregexunicodesurrogate-pairsastral-plane

Python semantics for unicode ranges involving astral planes


What exactly are the intended semantics for character ranges in regular expressions if one or both endpoints of the range are outside the BMP? I've observed that the following input behaves different in Python 2.7 and 3.5:

import re
bool(re.match(u"[\u1000-\U00021111]", "\u1234"))

In my 2.7 I get False, in 3.5 I get True. The latter makes sense to me. The former is perhaps due to \U00021111 being represented by a surrogate pair \ud844\udd11, but even then I don't understand it since \u1000-\ud844 should include \u1234 just fine.


Solution

  • Just use the u prefix with the input string to tell Python it is a Unicode string:

    >>> bool(re.match(u"[\u1000-\U00021111]", u"\u1234")) # <= See u"\u1234"
    True
    

    In Python 2.7, you need to decode the strings to Unicode each time you process them. In Python 3, all strings are Unicode by default, and it is stated in the docs.