pythonpandas

Special text to latin characters in python


I have the following pandas data frame:

the_df = pd.DataFrame({'id':[1,2],'name':['Joe','𝒮𝒶𝓇𝒶𝒽']})
the_df
    id  name
0   1   Joe
1   2   𝒮𝒶𝓇𝒶𝒽

As you can see, we can read the second name as "Sarah", but it's written with special characters.

I want to create a new column with these characters converted to latin characters. I have tried this approach:

the_df['latin_name'] = the_df['name'].str.extract(r'(^[a-zA-Z\s]*)')
the_df
    id  name    latin_name
0   1   Joe     Joe
1   2   𝒮𝒶𝓇𝒶𝒽  

But it doesn't recognize the letters. Please, any help on this will be greatly appreciated.


Solution

  • Try .str.normalize

    the_df['name'].str.normalize('NFKC').str.extract(r'(^[a-zA-Z\s]*)')
    

    Output:

           0
    0    Joe
    1  Sarah