pythonpandasoptimizationalarm

How can I iterate over these two very large alarm dataframes faster?


I have two pandas DataFrames. One is alarm condition starting events, and the other is alarm condition ending events. Each event has an Alarm ID and a Message Time, and the stop DataFrame has a value for Termination Type. For each event in start, I want to add the next time the same Alarm ID registers a stop and the Termination Type for that stop event. I also want to calculate the Alarm Active Seconds between the start event and the stop event. Some start events may not have a corresponding stop event, some start events may have a simultaneous stop event (in which case the Alarm Active Seconds is zero), and more than one start event may share the same stop event.

def alarmActiveTime(start, stop):
    i = 0
    while i < len(start):
        j = stop['Message Time'].searchsorted([start.iloc[i].loc['Message Time']], side='left')[0]
        match_found = False
        while (j < len(stop)) & (match_found == False):
            if ((stop.iloc[j].loc['Message Time'] >= start.iloc[i].loc['Message Time']) & (stop.iloc[j].loc['Alarm ID'] == start.iloc[i].loc['Alarm ID'])):
                start.at[i, 'Terminated Time'] = stop.iloc[j].loc['Message Time']
                start.at[i, 'Termination Type'] = stop.iloc[j].loc['Termination Type']
                start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
                match_found = True
                j = len(stop)
            else:
                j += 1
        if not match_found:
            start.at[i, 'Terminated Time'] = 'undefined'
            start.at[i, 'Alarm Active Seconds'] = 'undefined'
        if ((i % 1000 == 0) | (i == len(start)-1)):
            print(f'{i} OF {len(start)} COMPLETE.') 
            print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
            print('')
        i += 1 
    return start

I have about 570,000 start events and 480,000 stop events, so it takes hours to run. I'm looking for ways to improve the run time of this loop.

First, I tried filtering the stop DataFrame by Alarm ID.

def alarmActiveTime(start, stop):
    print('CALCULATING ACTIVE TIME OF ALARM...')
    i = 0
    while i < len(start):
        thisStop = stop.loc[(stop['Alarm ID'] == start.iloc[i].loc['Alarm ID'])]
        j = thisStop['Message Time'].searchsorted([start.iloc[i].loc['Message Time']], side='left')[0]
        match_found = False # Flag variable to track if a match is found
        while (j < len(thisStop)) & (match_found == False):
            if ((thisStop.iloc[j].loc['Message Time'] >= start.iloc[i].loc['Message Time']) & (thisStop.iloc[j].loc['Alarm ID'] == start.iloc[i].loc['Alarm ID'])):
                start.at[i, 'Terminated Time'] = thisStop.iloc[j].loc['Message Time']
                start.at[i, 'Termination Type'] = thisStop.iloc[j].loc['Termination Type']
                start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
                match_found = True
                j = len(thisStop)
            else:
                j += 1
        if not match_found:
            start.at[i, 'Terminated Time'] = 'undefined'
            start.at[i, 'Alarm Active Seconds'] = 'undefined'
        if ((i % 1000 == 0) | (i == len(start)-1)):
            print(f'{i} OF {len(start)} COMPLETE.') 
            print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
            print('')
        i += 1 
    print("%s HOURS ELAPSED." % round(((time.time() - start_time)/3600), 2))
    return start

I also tried filtering the stop DataFrame by Alarm ID and Message Time, and then returning the first element if the result was larger than 0.

def alarmActiveTime(start, stop):
    print('CALCULATING ACTIVE TIME OF ALARM...')
    i = 0
    while i < len(start):
        thisStop = stop.loc[(stop['Alarm ID'] == start.iloc[i].loc['Alarm ID'])]
        thisStop = thisStop.loc[thisStop['Message Time'] >= start.iloc[i].loc['Message Time']]
        if (len(thisStop) > 0):
                start.at[i, 'Terminated Time'] = thisStop.iloc[0].loc['Message Time']
                start.at[i, 'Termination Type'] = thisStop.iloc[0].loc['Termination Type']
                start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
        else:
            start.at[i, 'Terminated Time'] = 'undefined'
            start.at[i, 'Alarm Active Seconds'] = 'undefined'
        if ((i % 1000 == 0) | (i == len(start)-1)):
            print(f'{i} OF {len(start)} COMPLETE.') 
            print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
            print('')
        i += 1 
    return start

These methods were both faster than searching through an unfiltered stop log, but they were roughly as fast as each other when I timed the first 20,000 loops or so. I presume that filtering by Message Time could have a greater impact as I work through the list. Is there anything else I could do to run this kind of loop faster, or is it just going to take that long to do this many comparisons?

Edit: I have 14274 unique alarm ID values in the start dataframe. When I tried to merge the dataframes together, I got an array memory error. I cannot post a screenshot of the alarm logs, but will post another update after I've got something masked that I can post.


Solution

  • I have 14274 unique alarm ID values in the start dataframe. When I tried to merge the dataframes together, I got an array memory error. I couldn't post a screenshot of the alarm logs because the data is confidential. However, I was able to find a solution that cut the run time of the function from ~7 hours to ~30 minutes. I don't expect I will have to run this function very often, so this run time is acceptable.

    First I filtered the dataframe based on 'Alarm ID'. I tried merging the filtered dataframes, but that didn't generate the output I was looking for. I used apply.lambda() to generate a tuple of the variables I was looking for. Then I defined two new columns based on the tuple and concatenated the filtered dataframe to the output.

    def alarmActiveTime(start, stop):
        output = pd.DataFrame()
        listIDs = start['Alarm ID'].unique().tolist()
        for x in listIDs:
            def termination(time): #This was broken out into two different functions at first, but by returning a tuple we can get both important values at once without having to run essentially the same loop twice
                thisDF = thisStop[(thisStop['Message Time']>= time)]
                timeList = thisDF['Message Time'].tolist()
                typeList = thisDF['Termination Type'].tolist()
                if len(timeList) > 0:
                    timeOutput = timeList[0]
                    typeOutput = typeList[0]
                else:
                    timeOutput = 'Undefined'
                    typeOutput = 'Undefined'
                termTuple = (timeOutput, typeOutput)
                return termTuple
            thisStart = start.loc[start['Alarm ID'] == x]
            thisStop = stop.loc[stop['Alarm ID'] == x]
            thisStart['Termination Tuple'] = thisStart.apply(lambda x: termination(x['Message Time']), axis=1)
            thisStart['Termination Time'] = thisStart['Termination Tuple'].apply(lambda x: x[0])
            thisStart['Termination Type'] = thisStart['Termination Tuple'].apply(lambda x: x[1])
            thisStart.drop('Termination Tuple', axis=1, inplace=True)
            output = pd.concat([output, thisStart])
        return output
    

    I'm sure this could be further optimized, but this seemed like it was manageable. Thank you everyone who gave feedback. Let me know if I'm missing something.