pythondataframecsvioreshape

reshape dataframe based on event occurance and clearing


i want to see if there is an efficient way to reshape dataframe that is read from a tab separated csv file. the data consist of event codes and messages that are stacked up vertically with their timestamps. the data include a state column that specify whether an event occured (TRUE) or cleared (FALSE). i tried iterating through each row and update accordingly, but it is taking forever to complete.

every event code can occur and clear multiple times. every occur event (TRUE) is followed by clear event (FALSE) for every EventCode.

below example shows the format of the input file:

Timestamp                   State   Data    EventCode   EventMsg            Class
19-May-2023 16:10:09.301    FALSE   1       EventCode 1 EventCode 1 Message class 1
19-May-2023 16:10:09.300    FALSE   2       EventCode 2 EventCode 2 Message class 1
19-May-2023 16:10:09.299    TRUE    3       EventCode 1 EventCode 1 Message class 2
19-May-2023 16:10:09.298    FALSE   4       EventCode 4 EventCode 4 Message class 2
19-May-2023 16:10:09.297    FALSE   5       EventCode 3 EventCode 3 Message class 2
19-May-2023 16:10:09.296    TRUE    6       EventCode 2 EventCode 2 Message class 1
19-May-2023 16:10:09.295    TRUE    7       EventCode 4 EventCode 4 Message class 2
19-May-2023 16:10:09.294    TRUE    8       EventCode 3 EventCode 3 Message class 2
19-May-2023 16:10:09.293    FALSE   0       EventCode 1 EventCode 1 Message class 2
19-May-2023 16:10:09.292    TRUE    9       EventCode 1 EventCode 1 Message class 2

below shows desired final format:


OccurTimestamp              clearTimestamp          Data    EventCode   EventMsg            Class
19-05-2023 16:10:09.299     19-05-2023 16:10:09.301 3       EventCode 1 EventCode 1 Message class 1
19-05-2023 16:10:09.296     19-05-2023 16:10:09.300 6       EventCode 2 EventCode 2 Message class 1
19-05-2023 16:10:09.295     19-05-2023 16:10:09.298 7       EventCode 4 EventCode 3 Message class 2
19-05-2023 16:10:09.294     19-05-2023 16:10:09.297 8       EventCode 3 EventCode 4 Message class 2
19-05-2023 16:10:09.292     19-05-2023 16:10:09.293 9       EventCode 1 EventCode 1 Message class 1

Solution

  • Based on your example, I assume that each EventCode only exists exactly two times (once with State = True and once with state = False). Then this should work:

    # sort so that values with state = True are first
    # then group the values based on EventCode
    # then only get the first rows for each group (those with state = True)
    # then reset the index to get a normal dataframe back
    # and rename the Timestamp column to OccurTimestamp
    new_df = df.sort_values(by = 'State', ascending=False) \
               .groupby('EventCode', group_keys=False) \
               .first() \
               .reset_index() \
               .rename(columns={'Timestamp': 'OccurTimestamp'})
    
    # now we just need the corresponding clearTimetamps:
    clear_timestamps = df[df['State'] == False][['Timestamp', 'EventCode']].rename(columns = {'Timestamp': 'clearTimestamp'})
    
    # and merge both dataframes based on the EventCode
    final = pd.merge(new_df, clear_timestamps, on = 'EventCode')
    

    Edit: Based on your additional info, I think you still need to iterate through the events as your problem inherently requires to check following events for a clear event.

    You can still use

    grouped = df.groupby('EventCode')
    

    as each group then only consists of the individual EventCodes preserving the order of your original dataframe. This avoids iterating over all other events when trying to find a corresponding clear event.

    You can then iterate over each group via

    for name, group in grouped:
        ...