I would like to read in a csv-file. Sadly this is created by a buggy software: My header row contains multiple separators ";" and ",". Which wouldn't be a problem if my dataset wouldn't have a decimal ",".
I would like to automatically read in the files without changing the header by hand but I did not find something according to in the Pandas documentation.
The example data:
some file header row number 1
some file header row number two
some more information about data in that file
column1;column2;column3,column4,column5
1,234;2,665;0,888;3,891;3,762
2,232;1,233;0,888;3,789;3,524
Can I specify a different separator for the header than for the dataset itself? My code:
df = pd.read_csv(
"mypath\test.csv",
skiprows = 3;header= 0, delimiter=';,', decimal = ",")
which results in reading everything into one column
column1;column2;column3,column4,column5
0 1,234;2,665;0,888;3,891;3,762
1 2,232;1,233;0,888;3,789;3,524
But the result should be five columns:
column1 column2 column3 column4 column5
0 1.234 2.665 0.888 3.891 3.762
1 2.232 1.233 0.888 3.789 3.524
Assuming your headers start with a letter from [A-Za-z]
you can pass a regex with a positive lookahead for the comma-option to sep
inside pd.read_csv
:
import pandas as pd
from io import StringIO
csv = """some file header row number 1
some file header row number two
some more information about data in that file
column1;column2;column3,column4,column5
1,234;2,665;0,888;3,891;3,762
2,232;1,233;0,888;3,789;3,524"""
df = pd.read_csv(StringIO(csv),
sep=r';|,(?=[A-Za-z])',
skiprows=3,
engine='python',
decimal=',')
Output:
column1 column2 column3 column4 column5
0 1.234 2.665 0.888 3.891 3.762
1 2.232 1.233 0.888 3.789 3.524
Regex explanation. Or use: r';|,(?=[^\d])'
(not followed by a digit).
You can of course adjust the regex if you need a more strict solution to ensure that the option for the comma won't match for the subsequent rows.
Edit: to clarify, the equivalent for your file path is:
pd.read_csv("mypath\test.csv", sep=r';|,(?=[A-Za-z])', ...)
# replacing `StringIO(csv)` with `"mypath\test.csv"`
Not:
pd.read_csv(StringIO("mypath\test.csv"), sep=r';|,(?=[A-Za-z])', ...)
which generates the error mentioned in the comment:
EmptyDataError: No columns to parse from file