pandasencodingdta

Encoding Error of Reading .dta Files with Chinese Characters


I am trying to read .dta files with pandas:

import pandas as pd
my_data = pd.read_stata('filename', encoding='utf-8')

the error message is:

ValueError: Unknown encoding. Only latin-1 and ascii supported.

other encoding formality also didn't work, such as gb18030 or gb2312 for dealing with Chineses characters. If I remove the encoding parameter, the DataFrame will be all of garbage values.


Solution

  • Simply read the original data by default encoding, then transfer to the expected encoding! Suppose the column having garbled text is column1

    import pandas as pd
    dta = pd.read_stata('filename.dta')
    print(dta['column1'][0].encode('latin-1').decode('gb18030'))
    

    The print result will show normal Chinese characters, and gb2312 can also make it.