pythonpandasdataframefilter

Find rows where pandas dataframe column, which is a paragraph or list, contains any value in another list


I have a Pandas DataFrame which contains information about various jobs. I am working on filtering based on values in some lists.

I have no problem with single value conditional filtering. However, I am having difficulties doing conditional filtering on the Job Description field, which is essentially a paragraph and multiple lines, and the Job Skills field which is essentially a list after I split on the \n\n.

EXAMPLE DATA:

    dftest=pd.DataFrame({
        'Job Posting':['Data Scientist', 'Cloud Engineer', 'Systems Engineer', 'Data Engineer'],
        'Time Type':['Full Time', 'Part Time', 'Full Time', 'Part Time'], 
        'Job Location': ['Colorado', 'Maryland', 'Florida', 'Virginia'], 
        'Job Description': [
            'asdfas fasdfsad sadfsdaf sdfsdaf',
            'asdfasd fasdfasd fwertqqw rtwergd fverty',
            'qwerq e5r45yb rtfgs dfaesgf reasdfs dafads',
            'aweert scdfsdf asdfa sdfsds vwerewr'],
        'Job Skills': [
            'Algorithms\n\nData Analysis\n\nData Mining\n\nData Modeling\n\nData Science\n\nExploratory Data Analysis (EDA)\n\nMachine Learning\n\nUnstructured Data',
            'Application Development\n\nApplication Integrations\n\nArchitectural Modeling\n\nCloud Computing\n\nSoftware Product Design\n\nTechnical Troubleshooting',
            'Configuration Management (CM)\n\nInformation Management\n\nIntegration Testing\n\nRequirements Analysis\n\nRisk Management\n\nVerification and Validation (V&V)',
            'Big Data Analytics\n\nBig Data Management\n\nDatabase Management\n\nData Mining\n\nData Movement\n\nETL Processing\n\nMetadata Repository']
    })
Job Posting Time Type Job Location Job Description Job Skills
0 Data Scientist Full Time Maryland asdfas fasdfsad sadfsdaf sdfsdaf Algorithms\n\nData Analysis\n\nPython\n\n Data...
1 Cloud Engineer Part Time Maryland asdfasd fasdfasd fwertqqw rtwergd fverty Application Development\n\nApplication Integra...
2 Systems Engineer Full Time Virginia qwerq e5r45yb rtfgs dfaesgf reasdfs dafads Configuration Management (CM)\n\nInformation M...
3 Data Engineer Part Time Virginia aweert scdfsdf asdfa sdfsds vwerewr Big Data Analytics\n\nBig Data Management\n\nP...

LISTS and SPLITTING OF 'Job Skills' data by '\n\n':

    state = ['Virginia', 'Maryland', 'District of Columbia']
    time = ['Full Time']
    skills = ['AI', 'Artificial Intelligence', 'Deep Learning', 'Machine Learning', 
        'Feature Selection', 'Feature Selection', 'Python', 'Cloud Computing'] 
    dftest['Job Skills'] = dftest['Job Skills'].str.split('\n\n')

Results:

[Algorithms, Data Analysis, Data Mining, Data Modeling, Data Science, Exploratory Data Analysis (EDA), Machine Learning, Unstructured Data]

[Application Development, Application Integrations, Architectural Modeling, Cloud Computing, Software Product Design, Technical Troubleshooting]

[Configuration Management (CM), Information Management, Integration Testing, Requirements Analysis, Risk Management, Verification and Validation (V&V)]

[Big Data Analytics, Big Data Management, Database Management, Data Mining, Data Movement, ETL Processing, Metadata Repository]

CONDITIONAL FILTERING:

    dftest[dftest['Job Location'].isin(state) & dftest['Time Type'].isin(time)]

Results:

Job Posting Time Type Job Location Job Description Job Skills
0 Data Scientist Full Time Maryland asdfas fasdfsad sadfsdaf sdfsdaf [Algorithms, Data Analysis, Python, Data Mini...
2 Systems Engineer Full Time Virginia qwerq e5r45yb rtfgs dfaesgf reasdfs dafads Configuration Management (CM), Information Ma...

ISSUE: Now I want to take the values in dftest['Job Skills'] and find all the rows that match the skills list.

I've tried, among others:

I think I am almost there with this, but I just want to have a single unique row if there is a match. For example, this shows rows 0 and 3 match, so I want those rows to print.

    for i in skills:
        print('skill: ',i)
        print(dftest['Job Skills'].map(set([i]).issubset))

Solution

  • IIUC, you can use pd.Series.apply() + any():

    out = dftest[dftest["Job Skills"].apply(lambda x: any(s in skills for s in x))]
    print(out)
    

    Prints:

          Job Posting  Time Type Job Location                           Job Description                                                                                                                                        Job Skills
    0  Data Scientist  Full Time     Colorado          asdfas fasdfsad sadfsdaf sdfsdaf       [Algorithms, Data Analysis, Data Mining, Data Modeling, Data Science, Exploratory Data Analysis (EDA), Machine Learning, Unstructured Data]
    1  Cloud Engineer  Part Time     Maryland  asdfasd fasdfasd fwertqqw rtwergd fverty  [Application Development, Application Integrations, Architectural Modeling, Cloud Computing, Software Product Design, Technical Troubleshooting]