pythonpandaspython-module-unicodedata

More efficient way to replace special chars with their unicode name in pandas df


I have a large pandas dataframe and would like to perform a thorough text cleaning on it. For this, I have crafted the below code that evaluates if a character is either an emoji, number, Roman number, or a currency symbol, and replaces these with their unidode name from the unicodedata package.

The code uses a double for loop though and I believe there must be far more efficient solutions than that but I haven't managed to figure out yet how I could implement it in a vectorized manner.

My current code is as follows:

from unicodedata import name as unicodename 

def clean_text(text):
    for item in text:
        for char in item: 
            # Simple space
            if char == ' ':
                newtext += char 
            # Letters
            elif category(char)[0] == 'L':
                newtext += char
            # Other symbols: emojis
            elif category(char) == 'So':
                newtext += f" {unicodename(char)} "
            # Decimal numbers 
            elif category(char) == 'Nd':
                newtext += f" {unicodename(char).replace('DIGIT ', '').lower()} "
            # Letterlike numbers e.g. Roman numerals 
            elif category(char) == 'Nl':
                newtext += f" {unicodename(char)} "
            # Currency symbols
            elif category(char) == 'Sc':
                newtext += f" {unicodename(char).replace(' SIGN', '').lower()} "
            # Punctuation, invisibles (separator, control chars), maths symbols...
            else:
                newtext += " "

At the moment I am using this function on my dataframe with an apply:

df['Texts'] = df['Texts'].apply(lambda x: clean_text(x))

Sample data:

l = [
    "thumbs ups should be replaced: 👍👍👍",
    "hearts also should be replaced:  ❤️️❤️️❤️️❤️️",
    "also other emojis: ☺️☺️",
    "numbers and digits should also go: 40/40",
    "Ⅰ, Ⅱ, Ⅲ these are roman numerals, change 'em"
]
df = pd.DataFrame(l, columns=['Texts'])

Solution

  • A good start would be to not do as much work:

    1. once you've resolved the representation for a character, cache it. (lru_cache() does that for you)
    2. don't call category() and name() more times than you need to
    from functools import lru_cache
    from unicodedata import name as unicodename, category
    
    
    @lru_cache(maxsize=None)
    def map_char(char: str) -> str:
        if char == " ":  # Simple space
            return char
    
        cat = category(char)
    
        if cat[0] == "L":  # Letters
            return char
    
        name = unicodename(char)
    
        if cat == "So":  # Other symbols: emojis
            return f" {name} "
        if cat == "Nd":  # Decimal numbers
            return f" {name.replace('DIGIT ', '').lower()} "
        if cat == "Nl":  # Letterlike numbers e.g. Roman numerals
            return f" {name} "
        if cat == "Sc":  # Currency symbols
            return f" {name.replace(' SIGN', '').lower()} "
        # Punctuation, invisibles (separator, control chars), maths symbols...
        return " "
    
    
    def clean_text(text):
        for item in text:
            new_text = "".join(map_char(char) for char in item)
            # ...