pythonpandasdataframeseparator

csv to dataframe with several delimiters and software bug


I would like to read in a csv-file. Sadly this is created by a buggy software: My header row contains multiple separators ";" and ",". Which wouldn't be a problem if my dataset wouldn't have a decimal ",".

I would like to automatically read in the files without changing the header by hand but I did not find something according to in the Pandas documentation.

The example data:

some file header row number 1
some file header row number two
some more information about data in that file
column1;column2;column3,column4,column5
1,234;2,665;0,888;3,891;3,762
2,232;1,233;0,888;3,789;3,524

Can I specify a different separator for the header than for the dataset itself? My code:

df = pd.read_csv(
    "mypath\test.csv", 
    skiprows = 3;header= 0, delimiter=';,', decimal = ",")

which results in reading everything into one column

    column1;column2;column3,column4,column5
0   1,234;2,665;0,888;3,891;3,762
1   2,232;1,233;0,888;3,789;3,524

But the result should be five columns:

    column1     column2     column3     column4     column5
0   1.234   2.665   0.888   3.891   3.762
1   2.232   1.233   0.888   3.789   3.524

Solution

  • Assuming your headers start with a letter from [A-Za-z] you can pass a regex with a positive lookahead for the comma-option to sep inside pd.read_csv:

    import pandas as pd
    from io import StringIO
    
    csv = """some file header row number 1
    some file header row number two
    some more information about data in that file
    column1;column2;column3,column4,column5
    1,234;2,665;0,888;3,891;3,762
    2,232;1,233;0,888;3,789;3,524"""
    
    df = pd.read_csv(StringIO(csv), 
                     sep=r';|,(?=[A-Za-z])', 
                     skiprows=3, 
                     engine='python', 
                     decimal=',')
    

    Output:

       column1  column2  column3  column4  column5
    0    1.234    2.665    0.888    3.891    3.762
    1    2.232    1.233    0.888    3.789    3.524
    

    Regex explanation. Or use: r';|,(?=[^\d])' (not followed by a digit).

    You can of course adjust the regex if you need a more strict solution to ensure that the option for the comma won't match for the subsequent rows.


    Edit: to clarify, the equivalent for your file path is:

    pd.read_csv("mypath\test.csv", sep=r';|,(?=[A-Za-z])', ...)
    
    # replacing `StringIO(csv)` with `"mypath\test.csv"`
    

    Not:

    pd.read_csv(StringIO("mypath\test.csv"), sep=r';|,(?=[A-Za-z])', ...)
    

    which generates the error mentioned in the comment:

    EmptyDataError: No columns to parse from file