What exactly are the intended semantics for character ranges in regular expressions if one or both endpoints of the range are outside the BMP? I've observed that the following input behaves different in Python 2.7 and 3.5:
import re
bool(re.match(u"[\u1000-\U00021111]", "\u1234"))
In my 2.7 I get False
, in 3.5 I get True
. The latter makes sense to me. The former is perhaps due to \U00021111
being represented by a surrogate pair \ud844\udd11
, but even then I don't understand it since \u1000-\ud844
should include \u1234
just fine.
Just use the u
prefix with the input string to tell Python it is a Unicode string:
>>> bool(re.match(u"[\u1000-\U00021111]", u"\u1234")) # <= See u"\u1234"
True
In Python 2.7, you need to decode the strings to Unicode each time you process them. In Python 3, all strings are Unicode by default, and it is stated in the docs.