I have a Pandas DataFrame which contains information about various jobs. I am working on filtering based on values in some lists.
I have no problem with single value conditional filtering. However, I am having difficulties doing conditional filtering on the Job Description
field, which is essentially a paragraph and multiple lines, and the Job Skills
field which is essentially a list after I split on the \n\n
.
EXAMPLE DATA:
dftest=pd.DataFrame({
'Job Posting':['Data Scientist', 'Cloud Engineer', 'Systems Engineer', 'Data Engineer'],
'Time Type':['Full Time', 'Part Time', 'Full Time', 'Part Time'],
'Job Location': ['Colorado', 'Maryland', 'Florida', 'Virginia'],
'Job Description': [
'asdfas fasdfsad sadfsdaf sdfsdaf',
'asdfasd fasdfasd fwertqqw rtwergd fverty',
'qwerq e5r45yb rtfgs dfaesgf reasdfs dafads',
'aweert scdfsdf asdfa sdfsds vwerewr'],
'Job Skills': [
'Algorithms\n\nData Analysis\n\nData Mining\n\nData Modeling\n\nData Science\n\nExploratory Data Analysis (EDA)\n\nMachine Learning\n\nUnstructured Data',
'Application Development\n\nApplication Integrations\n\nArchitectural Modeling\n\nCloud Computing\n\nSoftware Product Design\n\nTechnical Troubleshooting',
'Configuration Management (CM)\n\nInformation Management\n\nIntegration Testing\n\nRequirements Analysis\n\nRisk Management\n\nVerification and Validation (V&V)',
'Big Data Analytics\n\nBig Data Management\n\nDatabase Management\n\nData Mining\n\nData Movement\n\nETL Processing\n\nMetadata Repository']
})
Job Posting | Time Type | Job Location | Job Description | Job Skills | |
---|---|---|---|---|---|
0 | Data Scientist | Full Time | Maryland | asdfas fasdfsad sadfsdaf sdfsdaf | Algorithms\n\nData Analysis\n\nPython\n\n Data... |
1 | Cloud Engineer | Part Time | Maryland | asdfasd fasdfasd fwertqqw rtwergd fverty | Application Development\n\nApplication Integra... |
2 | Systems Engineer | Full Time | Virginia | qwerq e5r45yb rtfgs dfaesgf reasdfs dafads | Configuration Management (CM)\n\nInformation M... |
3 | Data Engineer | Part Time | Virginia | aweert scdfsdf asdfa sdfsds vwerewr | Big Data Analytics\n\nBig Data Management\n\nP... |
LISTS and SPLITTING OF 'Job Skills' data by '\n\n':
state = ['Virginia', 'Maryland', 'District of Columbia']
time = ['Full Time']
skills = ['AI', 'Artificial Intelligence', 'Deep Learning', 'Machine Learning',
'Feature Selection', 'Feature Selection', 'Python', 'Cloud Computing']
dftest['Job Skills'] = dftest['Job Skills'].str.split('\n\n')
Results:
[Algorithms, Data Analysis, Data Mining, Data Modeling, Data Science, Exploratory Data Analysis (EDA), Machine Learning, Unstructured Data]
[Application Development, Application Integrations, Architectural Modeling, Cloud Computing, Software Product Design, Technical Troubleshooting]
[Configuration Management (CM), Information Management, Integration Testing, Requirements Analysis, Risk Management, Verification and Validation (V&V)]
[Big Data Analytics, Big Data Management, Database Management, Data Mining, Data Movement, ETL Processing, Metadata Repository]
CONDITIONAL FILTERING:
dftest[dftest['Job Location'].isin(state) & dftest['Time Type'].isin(time)]
Results:
Job Posting | Time Type | Job Location | Job Description | Job Skills | |
---|---|---|---|---|---|
0 | Data Scientist | Full Time | Maryland | asdfas fasdfsad sadfsdaf sdfsdaf | [Algorithms, Data Analysis, Python, Data Mini... |
2 | Systems Engineer | Full Time | Virginia | qwerq e5r45yb rtfgs dfaesgf reasdfs dafads | Configuration Management (CM), Information Ma... |
ISSUE:
Now I want to take the values in dftest['Job Skills']
and find all the rows that match the skills
list.
I've tried, among others:
dftest['Job Skills'].filter(like=skills, axis=0)
, but that gives another error.I think I am almost there with this, but I just want to have a single unique row if there is a match. For example, this shows rows 0 and 3 match, so I want those rows to print.
for i in skills:
print('skill: ',i)
print(dftest['Job Skills'].map(set([i]).issubset))
IIUC, you can use pd.Series.apply()
+ any()
:
out = dftest[dftest["Job Skills"].apply(lambda x: any(s in skills for s in x))]
print(out)
Prints:
Job Posting Time Type Job Location Job Description Job Skills
0 Data Scientist Full Time Colorado asdfas fasdfsad sadfsdaf sdfsdaf [Algorithms, Data Analysis, Data Mining, Data Modeling, Data Science, Exploratory Data Analysis (EDA), Machine Learning, Unstructured Data]
1 Cloud Engineer Part Time Maryland asdfasd fasdfasd fwertqqw rtwergd fverty [Application Development, Application Integrations, Architectural Modeling, Cloud Computing, Software Product Design, Technical Troubleshooting]