I have the following pandas data frame:
the_df = pd.DataFrame({'id':[1,2],'name':['Joe','𝒮𝒶𝓇𝒶𝒽']})
the_df
id name
0 1 Joe
1 2 𝒮𝒶𝓇𝒶𝒽
As you can see, we can read the second name as "Sarah", but it's written with special characters.
I want to create a new column with these characters converted to latin characters. I have tried this approach:
the_df['latin_name'] = the_df['name'].str.extract(r'(^[a-zA-Z\s]*)')
the_df
id name latin_name
0 1 Joe Joe
1 2 𝒮𝒶𝓇𝒶𝒽
But it doesn't recognize the letters. Please, any help on this will be greatly appreciated.
Try .str.normalize
the_df['name'].str.normalize('NFKC').str.extract(r'(^[a-zA-Z\s]*)')
Output:
0
0 Joe
1 Sarah