I have two pandas DataFrames. One is alarm condition starting events, and the other is alarm condition ending events. Each event has an Alarm ID and a Message Time, and the stop DataFrame has a value for Termination Type. For each event in start, I want to add the next time the same Alarm ID registers a stop and the Termination Type for that stop event. I also want to calculate the Alarm Active Seconds between the start event and the stop event. Some start events may not have a corresponding stop event, some start events may have a simultaneous stop event (in which case the Alarm Active Seconds is zero), and more than one start event may share the same stop event.
def alarmActiveTime(start, stop):
i = 0
while i < len(start):
j = stop['Message Time'].searchsorted([start.iloc[i].loc['Message Time']], side='left')[0]
match_found = False
while (j < len(stop)) & (match_found == False):
if ((stop.iloc[j].loc['Message Time'] >= start.iloc[i].loc['Message Time']) & (stop.iloc[j].loc['Alarm ID'] == start.iloc[i].loc['Alarm ID'])):
start.at[i, 'Terminated Time'] = stop.iloc[j].loc['Message Time']
start.at[i, 'Termination Type'] = stop.iloc[j].loc['Termination Type']
start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
match_found = True
j = len(stop)
else:
j += 1
if not match_found:
start.at[i, 'Terminated Time'] = 'undefined'
start.at[i, 'Alarm Active Seconds'] = 'undefined'
if ((i % 1000 == 0) | (i == len(start)-1)):
print(f'{i} OF {len(start)} COMPLETE.')
print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
print('')
i += 1
return start
I have about 570,000 start events and 480,000 stop events, so it takes hours to run. I'm looking for ways to improve the run time of this loop.
First, I tried filtering the stop DataFrame by Alarm ID.
def alarmActiveTime(start, stop):
print('CALCULATING ACTIVE TIME OF ALARM...')
i = 0
while i < len(start):
thisStop = stop.loc[(stop['Alarm ID'] == start.iloc[i].loc['Alarm ID'])]
j = thisStop['Message Time'].searchsorted([start.iloc[i].loc['Message Time']], side='left')[0]
match_found = False # Flag variable to track if a match is found
while (j < len(thisStop)) & (match_found == False):
if ((thisStop.iloc[j].loc['Message Time'] >= start.iloc[i].loc['Message Time']) & (thisStop.iloc[j].loc['Alarm ID'] == start.iloc[i].loc['Alarm ID'])):
start.at[i, 'Terminated Time'] = thisStop.iloc[j].loc['Message Time']
start.at[i, 'Termination Type'] = thisStop.iloc[j].loc['Termination Type']
start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
match_found = True
j = len(thisStop)
else:
j += 1
if not match_found:
start.at[i, 'Terminated Time'] = 'undefined'
start.at[i, 'Alarm Active Seconds'] = 'undefined'
if ((i % 1000 == 0) | (i == len(start)-1)):
print(f'{i} OF {len(start)} COMPLETE.')
print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
print('')
i += 1
print("%s HOURS ELAPSED." % round(((time.time() - start_time)/3600), 2))
return start
I also tried filtering the stop DataFrame by Alarm ID and Message Time, and then returning the first element if the result was larger than 0.
def alarmActiveTime(start, stop):
print('CALCULATING ACTIVE TIME OF ALARM...')
i = 0
while i < len(start):
thisStop = stop.loc[(stop['Alarm ID'] == start.iloc[i].loc['Alarm ID'])]
thisStop = thisStop.loc[thisStop['Message Time'] >= start.iloc[i].loc['Message Time']]
if (len(thisStop) > 0):
start.at[i, 'Terminated Time'] = thisStop.iloc[0].loc['Message Time']
start.at[i, 'Termination Type'] = thisStop.iloc[0].loc['Termination Type']
start.at[i, 'Alarm Active Seconds'] = int((start.iloc[i].loc['Terminated Time'] - start.iloc[i].loc['Message Time']).total_seconds())
else:
start.at[i, 'Terminated Time'] = 'undefined'
start.at[i, 'Alarm Active Seconds'] = 'undefined'
if ((i % 1000 == 0) | (i == len(start)-1)):
print(f'{i} OF {len(start)} COMPLETE.')
print("%s SECONDS ELAPSED." % round((time.time() - start_time), 0)) # https://stackoverflow.com/questions/1557571/
print('')
i += 1
return start
These methods were both faster than searching through an unfiltered stop log, but they were roughly as fast as each other when I timed the first 20,000 loops or so. I presume that filtering by Message Time could have a greater impact as I work through the list. Is there anything else I could do to run this kind of loop faster, or is it just going to take that long to do this many comparisons?
Edit: I have 14274 unique alarm ID values in the start dataframe. When I tried to merge the dataframes together, I got an array memory error. I cannot post a screenshot of the alarm logs, but will post another update after I've got something masked that I can post.
I have 14274 unique alarm ID values in the start dataframe. When I tried to merge the dataframes together, I got an array memory error. I couldn't post a screenshot of the alarm logs because the data is confidential. However, I was able to find a solution that cut the run time of the function from ~7 hours to ~30 minutes. I don't expect I will have to run this function very often, so this run time is acceptable.
First I filtered the dataframe based on 'Alarm ID'. I tried merging the filtered dataframes, but that didn't generate the output I was looking for. I used apply.lambda()
to generate a tuple of the variables I was looking for. Then I defined two new columns based on the tuple and concatenated the filtered dataframe to the output.
def alarmActiveTime(start, stop):
output = pd.DataFrame()
listIDs = start['Alarm ID'].unique().tolist()
for x in listIDs:
def termination(time): #This was broken out into two different functions at first, but by returning a tuple we can get both important values at once without having to run essentially the same loop twice
thisDF = thisStop[(thisStop['Message Time']>= time)]
timeList = thisDF['Message Time'].tolist()
typeList = thisDF['Termination Type'].tolist()
if len(timeList) > 0:
timeOutput = timeList[0]
typeOutput = typeList[0]
else:
timeOutput = 'Undefined'
typeOutput = 'Undefined'
termTuple = (timeOutput, typeOutput)
return termTuple
thisStart = start.loc[start['Alarm ID'] == x]
thisStop = stop.loc[stop['Alarm ID'] == x]
thisStart['Termination Tuple'] = thisStart.apply(lambda x: termination(x['Message Time']), axis=1)
thisStart['Termination Time'] = thisStart['Termination Tuple'].apply(lambda x: x[0])
thisStart['Termination Type'] = thisStart['Termination Tuple'].apply(lambda x: x[1])
thisStart.drop('Termination Tuple', axis=1, inplace=True)
output = pd.concat([output, thisStart])
return output
I'm sure this could be further optimized, but this seemed like it was manageable. Thank you everyone who gave feedback. Let me know if I'm missing something.