pythonstringpython-2.7unicode

Python Removing Non Latin Characters


How can I delete all the non latin characters from a string? More specifically, is there a way to find out Non Latin characters from unicode data?


Solution

  • Using the third-party regex module, you could remove all non-Latin characters with

    import regex
    result = regex.sub(ur'[^\p{Latin}]', u'', text)
    

    If you don't want to use the regex module, this page lists Latin unicode blocks:

    \p{InBasic_Latin}: U+0000–U+007F
    \p{InLatin-1_Supplement}: U+0080–U+00FF
    \p{InLatin_Extended-A}: U+0100–U+017F
    \p{InLatin_Extended-B}: U+0180–U+024F
    \p{InLatin_Extended_Additional}: U+1E00–U+1EFF 
    

    So you could use these to form a character class using Python's builtin re module:

    import re
    result = re.sub(ur'[^\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]', u'', text) 
    

    Demo:

    In [24]: import re
    In [25]: import regex
    
    In [35]: text = u'aweerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440'
    
    In [36]: print(text)
    aweerwqمرحباмир
    
    In [37]: regex.sub(ur'[^\p{Latin}]', u'', text)
    Out[37]: u'aweerwq'
    
    In [38]: re.sub(ur'[^\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]', u'', text)    
    Out[38]: u'aweerwq'