pythonpython-3.xunicodeasciipython-module-unicodedata

Remove special characters from string such as smileys but keep german special charactes


I know how to remove unwanted charactes in a string, like smileys etc. However, some languages like german have special charactes, too.

This is my current code:

import unicodedata
string = "süß πŸ˜†πŸ˜‹πŸ˜‰"
uni_str = str(unicodedata.normalize('NFKD', \
           string).encode('ascii','ignore'))

Is there the possibillity to keep the german special characters bu delete the other unwanted charactes, such as smileys like πŸ˜†πŸ˜‹πŸ˜‰? so that uni_str will hold the letters "süß" at the end?

Curently, the smileys will get deleted, but the german characters will either be transformed in other vocals or deletet, too.

The smileys in the example are just exemplary and can be any kind of unwanted character.

I am using Python 3.6 and Windows 10


Solution

  • You could do something simple like this (just add the German letters):

    def filter_characters(self, value):
        allowed_characters = " 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
        return ''.join(c for c in value if c in allowed_characters )
    

    Edit:

    Another possibilty is to create the allowed_characters with the help of the string module:

    import string
    allowed_characters = string.printable + 'âÀüß'