pythonstringnon-english

How to remove or filter non-english (chinese, korean, japanese, arabic) strings in list?


Here is an input example:

['ARTA Travel Group', 'Arta | آرتا', 'ARTAS™ Practice Development',    'ArtBinder', 'Arte Arac Takip App', 'アート建築', 'Arte Brasil Bar &    Grill', 'ArtPod Stage', 'Artpollo扫码', 'Artpollo阿波罗-价值最优的艺术品投资电商',    '아트홀']

Like above list, I want to remove elements with CHINESE, KOREAN, JAPANESE, ARBIC.

And below is the expected output (english only):

['ARTA Travel Group', 'ARTAS™ Practice Development', 'ArtBinder', 'Arte Arac Takip App', 'Arte Brasil Bar & Grill', 'ArtPod Stage']

Solution

  • You can use regex and search with unicode range. ™ belongs to Letterlike Symbols which ranges from 2100—214F; you can either include them all or just pick the specific ones.

    import re
    
    s = ['ARTA Travel Group', 'Arta | آرتا', 'ARTAS™ Practice Development', 'ArtBinder', 'Arte Arac Takip App', 'アート建築', 'Arte Brasil Bar & Grill', 'ArtPod Stage', 'Artpollo扫码', 'Artpollo阿波罗-价值最优的艺术品投资电商', '아트홀']
    
    result = [i for i in s if not re.findall("[^\u0000-\u05C0\u2100-\u214F]+",i)]
    
    print (result)
    
    ['ARTA Travel Group', 'ARTAS™ Practice Development', 'ArtBinder', 'Arte Arac Takip App', 'Arte Brasil Bar & Grill', 'ArtPod Stage']