I'm trying to extract substrings containing equipment names from the cells in a dataframe. Because of the way the data was created, these substrings can be in any cell. I created this program which uses "findall" and some regex to create a list of all the equipment found in the cells in a given row.
The problem is, the output isn't exactly as I need it. For example, if "findall" matches only one substring in the cell, my script does not add a delimiter afterwards. When the program continues to the next column, it joins the first column match with the second column matches, without a delimiter between the results. And I need the delimiter so I can explode the list later on.
Here is the code:
import pandas as pd
# IMPORT FILE AND CREATE DATAFRAME
d = {'Cause':['Consider checking XXX-1000 for deficiencies prior to train switch', 'XXX-2000 AND PPP-2200 to be taken out of service', 'Need to check XXX-3000 and potentially XXX-1000 for degradation'], 'Mitigation':['ZZZ-9999 is dependent on ZZZ-8000', 'These equipment will be out of service in 2025, not applicable', 'No further comments']}
df = pd.DataFrame(data=d)
# Trying the findall technique
df['new_eq'] = ""
for column in df.columns:
df['equipment'] = df['equipment'] + df[column].str.findall(r'\s*(\w{3,}-\d{4}\D*?) ').str.join('|')
if df['equipment'].str.contains('|') == False:
df['equipment'] += '|'
My output looks like this:
0 XXX-1000ZZZ-9999|ZZZ-8000
1 XXX-2000|PPP-2200
2 XXX-3000|XXX-1000
But I want it to look like this:
0 XXX-1000|ZZZ-9999|ZZZ-8000
1 XXX-2000|PPP-2200
2 XXX-3000|XXX-1000
So I added the last two lines of above to try to add the pipe character. It doesn't work and is giving me the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I know this is because the program expects a boolean value but I can't figure out how to fix it.
I suggest this solution:
import pandas as pd
# IMPORT FILE AND CREATE DATAFRAME
d = {'Cause':['Consider checking XXX-1000 for deficiencies prior to train switch', 'XXX-2000 AND PPP-2200 to be taken out of service', 'Need to check XXX-3000 and potentially XXX-1000 for degradation'], 'Mitigation':['ZZZ-9999 is dependent on ZZZ-8000', 'These equipment will be out of service in 2025, not applicable', 'No further comments']}
df = pd.DataFrame(data=d)
df['equipment'] = (df['Cause'] + ' ' + df['Mitigation']).str.findall(r'(\w{3,}-\d{4})').apply(lambda x: '|'.join(x))
df['equipment'] = df['equipment'].apply(lambda x: x.rstrip('|') if x.endswith('|') else x)
for i in df['equipment']:
print(i)
which returns:
XXX-1000|ZZZ-9999|ZZZ-8000
XXX-2000|PPP-2200
XXX-3000|XXX-1000
or simply
df['equipment]
giving
0 XXX-1000|ZZZ-9999|ZZZ-8000
1 XXX-2000|PPP-2200
2 XXX-3000|XXX-1000
Name: equipment, dtype: object