pythonpandascsvread-csv

importing weird string csv as float


U_d = pd.read_csv("Uzud.csv", decimal = ',')
print(U_d)

what comes out:

     f in kHz  U in mV  Amplitude in Vpp  entfernung in cm
43.7       52        1               5.0               9.0
43.7       43        5              11.0               NaN
43.7       30        5              13.0               NaN
43.7     27,5        5              15.0               NaN
43.7       24        5              16.0               NaN

(Note that the value 27.5 in 'U in mV' is 27,5 and interpreted as a string.)

In all columns all data are actually floats but unfortunately it is stored like this:

f in kHz,U in mV,amplitude in Vpp,distance in cm
"43,7",52,1,5,9
"43,7",43,5,11
"43,7",30,5,13
"43,7", "27,5",5,15
"43,7",24,5,16

which causes some data in 'U in mV' to be interpreted as a string. I want everything to be stored as a float.


Solution

  • Use pd.read_csv with skipinitialspace=True. Add dtype=float, if integers need to become floats as well. For "bad lines" correction, see below.

    import pandas as pd
    from io import StringIO
    
    s = """f in kHz,U in mV,amplitude in Vpp,distance in cm
    "43,7",52,1,5,9
    "43,7",43,5,11
    "43,7",30,5,13
    "43,7", "27,5",5,15
    "43,7",24,5,16
    """
    
    df = pd.read_csv(StringIO(s), decimal=',', skipinitialspace=True, dtype=float)
    

    Output

          f in kHz  U in mV  amplitude in Vpp  distance in cm
    43.7      52.0      1.0               5.0             9.0
    43.7      43.0      5.0              11.0             NaN
    43.7      30.0      5.0              13.0             NaN
    43.7      27.5      5.0              15.0             NaN
    43.7      24.0      5.0              16.0             NaN
    

    As noted in the answer by @AutumnKome, your first line appears to be flawed:

    "43,7",52,1,5,9`
    # should probably be 
    "43,7","52,1",5,9
    

    Since pd 1.4.0 you can try and handle such cases, passing a callable to on_bad_lines.

    It's a bit annoying that the first error appears in the first non-header row, leading to the issue discussed in this SO post, with submitted ticket here.

    As a workaround, use header=None, which will push the headers to row 1, and correct it afterwards.

    bad_rows = []
    
    def handle_bad_lines(row):
        new_row = row[:1] + [float('.'.join(row[1:3]))] + row[3:]
        # ['43.7', '52', '1', '5', '9'] ->
        # ['43.7', 52.1, '5', '9']
        return new_row
    
    df = pd.read_csv(StringIO(s), decimal=',', 
                     skipinitialspace=True, 
                     header=None,
                     on_bad_lines=handle_bad_lines, 
                     engine='python') # with `pd 2.2.0`, use `engine='pyarrow'`
    
    df.head(3)
    
              0        1                 2               3
    0  f in kHz  U in mV  amplitude in Vpp  distance in cm # <header pushed to row 1
    1      43.7     52.1                 5               9
    2      43.7       43                 5              11
    

    Correcting the header, index, and adjusting the dtype (using df.T, df.set_index, df.rename_axis, df.reset_index and df.astype).

    df = (
        df.T
        .set_index(0)
        .T
        .rename_axis(columns=None)
        .reset_index(drop=True)
        .astype(float)
    )
    

    Output (bad lines corrected, header restored)

       f in kHz  U in mV  amplitude in Vpp  distance in cm
    0      43.7     52.1               5.0             9.0
    1      43.7     43.0               5.0            11.0
    2      43.7     30.0               5.0            13.0
    3      43.7     27.5               5.0            15.0
    4      43.7     24.0               5.0            16.0
    

    If your actual data has more complicated bad lines, first maybe collect them and investigate how they can best be handled. E.g.,

    bad_rows = []
    
    def handle_bad_lines(row):
        global bad_rows
        bad_rows.append(row)
        return None # returning `None` is effectively skipping the line
    
    # after executing `df = pd.read_csv(...)`
    print(bad_rows)
    # [['43.7', '52', '1', '5', '9']]