pythonregexnon-english

Detect strings with non English characters in Python


I have some strings that have a mix of English and none English letters. For example:

w='_1991_اف_جي2'

How can I recognize these types of string using Regex or any other fast method in Python?

I prefer not to compare letters of the string one by one with a list of letters, but to do this in one shot and quickly.


Solution

  • You can just check whether the string can be encoded only with ASCII characters (which are Latin alphabet + some other characters). If it can not be encoded, then it has the characters from some other alphabet.

    Note the comment # -*- coding: ..... It should be there at the top of the python file (otherwise you would receive some error about encoding)

    # -*- coding: utf-8 -*-
    def isEnglish(s):
        try:
            s.encode(encoding='utf-8').decode('ascii')
        except UnicodeDecodeError:
            return False
        else:
            return True
    
    assert not isEnglish('slabiky, ale liší se podle významu')
    assert isEnglish('English')
    assert not isEnglish('ގެ ފުރަތަމަ ދެ އަކުރު ކަ')
    assert not isEnglish('how about this one : 通 asfަ')
    assert isEnglish('?fd4))45s&')