Here's a small example:
reg = ur"((?P<initial>[+\-π])(?P<rest>.+?))$"
(In both cases the file has -*- coding: utf-8 -*-
)
In Python 2:
re.match(reg, u"πhello").groupdict()
# => {u'initial': u'\ud83d', u'rest': u'\udc4dhello'}
# unicode why must you do this
Whereas, in Python 3:
re.match(reg, "πhello").groupdict()
# => {'initial': 'π', 'rest': 'hello'}
The above behaviour is 100% perfect, but switching to Python 3 is currently not an option. What's the best way to replicate 3's results in 2, that works in both narrow and wide Python builds? The π appears to be coming to me in the format "\ud83d\udc4d", which is what's making this tricky.
In a Python 2 narrow build, non-BMP characters are two surrogate code points, so you can't use them in the []
syntax correctly. u'[π]
is equivalent to u'[\ud83d\udc4d]'
, which means "match one of \ud83d
or \udc4d
. Python 2.7 example:
>>> u'\U0001f44d' == u'\ud83d\udc4d' == u'π'
True
>>> re.findall(u'[π]',u'π')
[u'\ud83d', u'\udc4d']
To fix in both Python 2 and 3, match u'π
OR [+-]
. This returns the correct result in both Python 2 and 3:
#coding:utf8
from __future__ import print_function
import re
# Note the 'ur' syntax is an error in Python 3, so properly
# escape backslashes in the regex if needed. In this case,
# the backslash was unnecessary.
reg = u"((?P<initial>π|[+-])(?P<rest>.+?))$"
tests = u'πhello',u'-hello',u'+hello',u'\\hello'
for test in tests:
m = re.match(reg,test)
if m:
print(test,m.groups())
else:
print(test,m)
Output (Python 2.7):
πhello (u'\U0001f44dhello', u'\U0001f44d', u'hello')
-hello (u'-hello', u'-', u'hello')
+hello (u'+hello', u'+', u'hello')
\hello None
Output (Python 3.6):
πhello ('πhello', 'π', 'hello')
-hello ('-hello', '-', 'hello')
+hello ('+hello', '+', 'hello')
\hello None