pythoncsvpandasunicodeiso-8859-1

convert em-dash to hyphen in python


I'm converting csv files into python Dataframe. And in the original file, one of the column has characters em-dash. I want it replaced by hyphen "-".

Partial original file from csv:

 NoDemande     NoUsager     Sens    IdVehicule     NoConduteur     HeureDebutTrajet    HeureArriveeSurSite    HeureEffective'
42192001801   42192002715    —        157Véh       42192000153    ...
42192000003   42192002021    +        157Véh       42192000002    ...
42192001833   42192000485    —      324My3FVéh     42192000157    ...

My code:

#coding=latin-1
import pandas as pd
import glob

pd.set_option('expand_frame_repr', False)

path = r'D:\Python27\mypfe\data_test'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,header=0,sep=';',parse_dates=['HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'],
                      dayfirst=True)
    df['Sens'].replace(u"\u2014","-",inplace=True,regex=True)
    list_.append(df)

And it doesn't work at all, every time it only convert them into ?, which looks like it:

42191001122  42191002244    ?            181Véh   42191000114  ...
42191001293  42191001203    ?         319M9pVéh   42191000125  ...
42191000700  42191000272    ?            183Véh   42191000072  ...

And because I have french characters in the file, I'm using latin-1 instead of utf-8. If I delete the first line and write like this:

df = pd.read_csv(file_,index_col=None,header=0,sep=';',encoding='windows-1252',parse_dates=['HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'],
                          dayfirst=True)

The result will be:

42191001122  42191002244  â??           181Véh   42191000114   ...
42191001293  42191001203  â??        319M9pVéh   42191000125   ...
42191000700  42191000272  â??           183Véh   42191000072   ...

How can I make all the em-dash replaced by -?

I added the part about repr:

for line in open(file_):
    print repr(line)

And the result turns out:

'"42191002384";"42191000118";"\xe2\x80\x94";"";"42191000182";...
'"42191002464";"42191001671";"+";"";"42191000182";...
'"42191000045";"42191000176";"\xe2\x80\x94";"620M9pV\xc3\xa9h";"42191000003";...
'"42191001305";"42191000823";"\xe2\x80\x94";"310V7pV\xc3\xa9h";"42191000126";...

Solution

  • u'\u2014' (EM DASH) can not be encoded in latin1/iso-8859-1, so that value can not appear in a properly encoded latin1 file.

    Possibly the files are encoded as windows-1252 for which u'\u2014' can be encoded as '\x97'.

    Another problem is that the CSV file apparently uses whitespace as the column separator, but your code uses semicolons. You can specify whitespace as the separator using delim_whitespace=True:

    df = pd.read_csv(file_, delim_whitespace=True)
    

    You can also specify the file's encoding using the encoding parameter. read_csv() will convert the incoming data to unicode:

    df = pd.read_csv(file_, encoding='windows-1252', delim_whitespace=True)
    

    In Python 2 (I think that you're using that), if you do not specify the encoding, the data remains in the original encoding, and this is probably the reason that your replacements are not working.

    Once you have properly loaded the file, you can replace characters as you have been doing:

    df = pd.read_csv(file_, encoding='windows-1252', delim_whitespace=True)
    df['Sens'].replace(u'\u2014', '-', inplace=True)
    

    EDIT

    Following your update where you show the repr() output, your file would appear to be UTF8 encoded, not latin1, and not Windows-1252. Since you are using Python 2 you need to specify the encoding when loading the CSV file:

    df = pd.read_csv(file_, sep=';', encoding='utf8')
    df['Sens'].replace(u'\u2014', '-', inplace=True)
    

    Because you specified an encoding, read_csv() will convert the incoming data to unicode, so replace() should now work as shown above. It should be that easy.