pythonpandasunicodeapplylatin

Convert Latin letters to its corresponding English alphabet in pandas


I have a dataframe in pandas that contains restaurants name in it, but the problem is some restaurants name which include Latin letters eg é in Cafe, â in Yauatcha Pâtisserie are all encoded by the pandas differently for eg Yauatcha Pâtisserie was encoded by pandas as

Yauatcha PÃ\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82¢tisserie

There are different types of restaurant name containing different Latin letter which are encoded by pandas differently. Is there any way to get the Latin letter back or its English equivalent back?

You can download the dataset here. I tried using the unicode library of python but that does not seem to be working. Here is what I have tried :

import pandas as pd
import unidecode
df = pd.read_csv(r"stod.csv", encoding='latin1')
df['name'].apply(unidecode.unidecode)

So is there any way to get back the latin alphabet back from this gibberish

Yauatcha PÃ\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82¢tisserie

Note: I tried all possible suggestions for the solution to this question and none of that worked for me.


Solution

  • It's a multiple mojibake. I can revert it (see demoji(x) function in the following script). For the sake of completeness, moji(x) function shows the mojibake mechanism:

    def demoji(x):
        global ii
        try:
          y = x.encode('latin-1').decode('utf-8')
          ii += 1
        except:
          y = x
          ii = -ii
        return y
    
    def moji(x):
        return x.encode('utf-8').decode('latin-1','backslash replace')
    
    xx = 'PÃ\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82¢tisserie'
    zz = xx
    
    print("values xx and zz (initial): {}".format(repr(xx)))
    ii = 0
    while ii >= 0:
        xx = demoji(xx)
    
    yy = xx
    print("values xx and yy after {} demoji(xx) iterations: {}".format(-ii,repr(xx)))
    for i in range(-ii):
        yy = moji(yy)
    
    print("values yy and zz after {}   moji(yy) iterations are equal: {}".format(-ii, yy==zz))
    

    Result: .\SO\55721108.py

    values xx and zz (initial): 'PÃ\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82¢tisserie'
    values xx and yy after 7 demoji(xx) iterations: 'Pâtisserie'
    values yy and zz after 7   moji(yy) iterations are equal: True