I'm converting csv files into python Dataframe. And in the original file, one of the column has characters em-dash. I want it replaced by hyphen "-".
Partial original file from csv:
NoDemande NoUsager Sens IdVehicule NoConduteur HeureDebutTrajet HeureArriveeSurSite HeureEffective'
42192001801 42192002715 — 157Véh 42192000153 ...
42192000003 42192002021 + 157Véh 42192000002 ...
42192001833 42192000485 — 324My3FVéh 42192000157 ...
My code:
#coding=latin-1
import pandas as pd
import glob
pd.set_option('expand_frame_repr', False)
path = r'D:\Python27\mypfe\data_test'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None,header=0,sep=';',parse_dates=['HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'],
dayfirst=True)
df['Sens'].replace(u"\u2014","-",inplace=True,regex=True)
list_.append(df)
And it doesn't work at all, every time it only convert them into ?
, which looks like it:
42191001122 42191002244 ? 181Véh 42191000114 ...
42191001293 42191001203 ? 319M9pVéh 42191000125 ...
42191000700 42191000272 ? 183Véh 42191000072 ...
And because I have french characters in the file, I'm using latin-1
instead of utf-8
. If I delete the first line and write like this:
df = pd.read_csv(file_,index_col=None,header=0,sep=';',encoding='windows-1252',parse_dates=['HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'],
dayfirst=True)
The result will be:
42191001122 42191002244 â?? 181Véh 42191000114 ...
42191001293 42191001203 â?? 319M9pVéh 42191000125 ...
42191000700 42191000272 â?? 183Véh 42191000072 ...
How can I make all the em-dash —
replaced by -
?
I added the part about repr
:
for line in open(file_):
print repr(line)
And the result turns out:
'"42191002384";"42191000118";"\xe2\x80\x94";"";"42191000182";...
'"42191002464";"42191001671";"+";"";"42191000182";...
'"42191000045";"42191000176";"\xe2\x80\x94";"620M9pV\xc3\xa9h";"42191000003";...
'"42191001305";"42191000823";"\xe2\x80\x94";"310V7pV\xc3\xa9h";"42191000126";...
u'\u2014'
(EM DASH) can not be encoded in latin1/iso-8859-1, so that value can not appear in a properly encoded latin1 file.
Possibly the files are encoded as windows-1252 for which u'\u2014'
can be encoded as '\x97'
.
Another problem is that the CSV file apparently uses whitespace as the column separator, but your code uses semicolons. You can specify whitespace as the separator using delim_whitespace=True
:
df = pd.read_csv(file_, delim_whitespace=True)
You can also specify the file's encoding using the encoding
parameter. read_csv()
will convert the incoming data to unicode:
df = pd.read_csv(file_, encoding='windows-1252', delim_whitespace=True)
In Python 2 (I think that you're using that), if you do not specify the encoding, the data remains in the original encoding, and this is probably the reason that your replacements are not working.
Once you have properly loaded the file, you can replace characters as you have been doing:
df = pd.read_csv(file_, encoding='windows-1252', delim_whitespace=True)
df['Sens'].replace(u'\u2014', '-', inplace=True)
EDIT
Following your update where you show the repr()
output, your file would appear to be UTF8 encoded, not latin1, and not Windows-1252. Since you are using Python 2 you need to specify the encoding when loading the CSV file:
df = pd.read_csv(file_, sep=';', encoding='utf8')
df['Sens'].replace(u'\u2014', '-', inplace=True)
Because you specified an encoding, read_csv()
will convert the incoming data to unicode, so replace()
should now work as shown above. It should be that easy.