pandasdataframecsviopython-3.11

Optimize reshaping dataframe in pandas


I have a csv data log in the following format that i'm reading into dataframe then taking multiple steps to reshape it making it look like the below desired output. i got this to work, however it is taking 15 mins complete csv file with 100K rows. i feel like there gotta be a more efficient way. thank you in advance for your help.

my current approach is:

Input File Format:

Index,Type,Signals,Timestamp,Data
614,type1,SignalName1,2023-10-13 03:41:31.757778999 EDT,"signalName=SignalName1, value1=0.00402894988656044, value2=true, value3=0, value4=false"
615,type2,SignalName2,2023-10-13 03:41:31.757779000 EDT,"signalName=SignalName2, value1=0.004700441379100084, value2=true, value3=0, value4=true"
616,type3,SignalName3,2023-10-13 03:41:31.757779001 EDT,"signalName=SignalName3, value1=32.0, value2=false, value3=4, value4=false"
617,type1,SignalName1,2023-10-13 03:41:31.757779002 EDT,"signalName=SignalName1, value1=0.0053719328716397285, value2=true, value3=0, value4=false"
618,type2,SignalName2,2023-10-13 03:41:31.757779003 EDT,"signalName=SignalName2, value1=0.005216674879193306, value2=false, value3=4, value4=false"
619,type3,SignalName3,2023-10-13 03:41:31.757779004 EDT,"signalName=SignalName3, value1=0.0053719328716397285, value2=true, value3=0, value4=true"
620,type1,SignalName1,2023-10-13 03:41:31.757779005 EDT,"signalName=SignalName1, value1=9.9627685546875, value2=false, value3=4, value4=false"
621,type2,SignalName2,2023-10-13 03:41:31.757779006 EDT,"signalName=SignalName2, value1=0.00402894988656044, value2=true, value3=0, value4=false"
622,type3,SignalName3,2023-10-13 03:41:31.757779007 EDT,"signalName=SignalName3, value1=93.6219711303711, value2=true, value3=0, value4=false"
623,type1,SignalName1,2023-10-13 03:41:31.757779008 EDT,"signalName=SignalName1, value1=0.0585092157125473, value2=true, value3=0, value4=false"
624,type2,SignalName2,2023-10-13 03:41:31.757779009 EDT,"signalName=SignalName2, value1=3.365873098373413, value2=true, value3=0, value4=true"
625,type3,SignalName3,2023-10-13 03:41:31.757779010 EDT,"signalName=SignalName3, value1=9.9627685546875, value2=true, value3=0, value4=false"
626,type1,SignalName1,2023-10-13 03:41:31.757779011 EDT,"signalName=SignalName1, value1=0.00402894988656044, value2=true, value3=0, value4=false"
627,type2,SignalName2,2023-10-13 03:41:31.757779012 EDT,"signalName=SignalName2, value1=0.0053719328716397285, value2=false, value3=3, value4=false"
628,type3,SignalName3,2023-10-13 03:41:31.757779013 EDT,"signalName=SignalName3, value1=0.00402894988656044, value2=true, value3=0, value4=false"
629,type1,SignalName1,2023-10-13 03:41:31.757779014 EDT,"signalName=SignalName1, value1=0.2685966491699219, value2=true, value3=0, value4=false"
630,type2,SignalName2,2023-10-13 03:41:31.757779015 EDT,"signalName=SignalName2, value1=0.006714915856719017, value2=true, value3=0, value4=true"
631,type3,SignalName3,2023-10-13 03:41:31.757779016 EDT,"signalName=SignalName3, value1=0.0033574579283595085, value2=true, value3=0, value4=false"
632,type1,SignalName1,2023-10-13 03:41:31.757779017 EDT,"signalName=SignalName1, value1=0.006714915856719017, value2=false, value3=5, value4=false"
633,type2,SignalName2,2023-10-13 03:41:31.757779018 EDT,"signalName=SignalName2, value1=0.0033574579283595085, value2=true, value3=0, value4=false"
634,type3,SignalName3,2023-10-13 03:41:31.757779019 EDT,"signalName=SignalName3, value1=0.0033574579283595085, value2=true, value3=0, value4=false"

Desired Output:

note for the "nan", i'm using df.fillna(method='ffill') & df.fillna(method='backfill') to populate which is also taking too much time to complete. enter image description here


Solution

  • After you have unpacked your 'Data' column into multiple columns, so you have a dataframe df with columns (e.g.)...

    Index   Type    Signals     Timestamp   Value1  Value2
    

    ... then you can get the reshaping you're looking for with

    dfp = df.pivot_table(index="Timestamp", columns="Signals", 
                         values=["Value1", "Value2"], aggfunc='first').reset_index()
    dfp.columns = [f"{x[0]}_{x[1]}" for x in dfp.columns]