I received a CSV file where ,
is the separator used to separate fields, but unfortunately additional as the symbol denoting the decimal point (german notation).
As a result, some rows will have a different number of columns. Strangely excel will parse /read the file rather fine. Is it possible to read such files in pandas as well? So far I only got something similar to
Error tokenizing data. C error: Expected 97 fields in line 3, saw 98
Here is a minimal example:
pd.read_csv(os.path.expanduser('~/Downloads/foo.csv'), sep=',', decimal=',')
with ~/Downloads/foo.csv
file with the content of
first, number, third
some, 1, other
foo, 1.5, bar
baz, 1,5, some
When I load the data in R
See spec(...) for full column specifications.
Warnung: 1538 parsing failures.
row col expected actual
1 -- 93 columns 97 columns
2 -- 93 columns 98 columns
3 -- 93 columns 97 columns
4 -- 93 columns 102 columns
5 -- 93 columns 99 columns
Is there such a permissive mode in pandas?
Make sure there is no quote delimiter in your file that you should declare to read_csv.
If you file is ill-formed, there is mathematically no deterministic algorithm than can decide if one succession of characters with a comma is two fields, or just one with a comma separated number.
You will have to write a preprocessor that does a clean-up of the ill-formed data with an ad-hoc algorithm approaching the reality of you file. That can be nasty like I take the assumption digits followed by comma followed by 3 digits are actually the same field and any other variation of these fixes.
You can also face cases where even that won't be deterministic, then you have no way but to go to the data source and ask for another file format of for data fix.
To drop the wrong lines and load the other ones, these parameters from the documentation will help:
error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)
warn_bad_lines : boolean, default True If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output. (Only valid with C parser).