pythonpython-3.xpandasdata-import

'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte


I am trying to read in a dataset called df1, but it does not work

import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")

df1.head()

Here are huge errors from the above code, but this is the most relevant

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

Solution

  • The data is indeed not encoded as UTF-8; everything is ASCII except for that single 0x92 byte:

    b'Korea, Dem. People\x92s Rep.'
    

    Decode it as Windows codepage 1252 instead, where 0x92 is a fancy quote, :

    df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
                      sep=";", encoding='cp1252')
    

    Demo:

    >>> import pandas as pd
    >>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
    ...                   sep=";", encoding='cp1252')
    >>> df1.head()
                       2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  \
    0     Afghanistan  55.1  55.5  55.9  56.2  56.6  57.0  57.4  57.8  58.2  58.6
    1         Albania  74.3  74.7  75.2  75.5  75.8  76.1  76.3  76.5  76.7  76.8
    2         Algeria  70.2  70.6  71.0  71.4  71.8  72.2  72.6  72.9  73.2  73.5
    3  American Samoa    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..
    4         Andorra    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..
    
       2010  2011  2012  2013  Unnamed: 15  2014  2015
    0  59.0  59.3  59.7  60.0          NaN  60.4  60.7
    1  77.0  77.2  77.4  77.6          NaN  77.8  78.0
    2  73.8  74.1  74.3  74.6          NaN  74.8  75.0
    3    ..    ..    ..    ..          NaN    ..    ..
    4    ..    ..    ..    ..          NaN    ..    ..
    

    I note however, that Pandas seems to take the HTTP headers at face value too and produces a Mojibake when you load your data from a URL. When I save the data directly to disk, then load it with pd.read_csv() the data is correctly decoded, but loading from the URL produces re-coded data:

    >>> df1[' '][102]
    'Korea, Dem. People’s Rep.'
    >>> df1[' '][102].encode('cp1252').decode('utf8')
    'Korea, Dem. People’s Rep.'
    

    This is a known bug in Pandas. You can work around this by using urllib.request to load the URL and pass that to pd.read_csv() instead:

    >>> import urllib.request
    >>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
    ...     df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
    ...
    >>> df1[' '][102]
    'Korea, Dem. People’s Rep.'