pythonregexpython-2.7chinese-locale

special character with chinese characters not substituted in python string


I cannot seem to substitute a ')' or a '(' without causing errors in other strings. ')' and '(' are special characters. Here are two strings "sample(志信达).mbox" and "sample#宋安兴.mbox" . If I use re to substitute the characters,the chinese character suffers a substitution too. Here is the code in python:

# -*- coding: utf-8 -*-
import re
source1='sample(志信达).mbox'
source2='sample#宋安兴.mbox'
newname1=re.sub(r'[\(\);)(]','-',source1)
newname2=re.sub(r'[\(\);)(]','-',source2)
print source1,newname1
print source2,newname2

Here is the result:

sample(志信达).mbox sample---志信达---.mbox
sample#宋安兴.mbox sample#宋?-兴.mbox

Notice that one of the characters is replaced with '?-'


Solution

  • You should use unicode literals (see https://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code):

    # -*- coding: utf-8 -*-
    import re
    source1 = u'sample(志信达).mbox'
    source2 = u'sample#宋安兴.mbox'
    newname1 = re.sub(ur'[\(\);)(]','-',source1)
    newname2 = re.sub(ur'[\(\);)(]','-',source2)
    print source1,newname1
    print source2,newname2
    

    result:

    sample(志信达).mbox sample-志信达-.mbox
    sample#宋安兴.mbox sample#宋安兴.mbox
    

    Also, do not forget to save your .py file in UTF-8 (your IDE may do this automatically or you may have to manually change encoding depending on the text editor you use).