pythonpandasdataframelistfindall

Check if Python list elements are in a Pandas dataframe row and append each unique occurrence of the list values to a new column


My question is similar to this question raised However, The solution provided appears to work where only one element of the list will exist in the dataframe column.

My issue is that multiple values from the list might exist in each row of the df and the list values might be referenced more than once.

I have a dataframe which includes a number of job listings. The Job description column contains the full job description. To extract the key skill requirements from the job description column a number of additional columns were created. eg. Programming_Languages,Data visualisation etc

Each column has a list of values to check for there occurrence in the job description eg.

pl_list=['Python','RStudio','C#','HTML5']
dv_list=['Power BI,Tableau']

My current solution:

df2['Programming_Languages'] = df2['Programming_Languages'].str.findall('|'.join(pl_list),re.IGNORECASE)
    df2['Data_Visualisation'] = df2['Data_Visualisation'].str.findall('|'.join(dv_list),re.IGNORECASE)

The job description might reference the values in a list multiple times. so if python and Rstudio are referenced more than once it will return each occurrence. I would just like it to return the value once.

Code to recreate df

df=[['ABC','ABC','Python, C#, Power BI Tableau python Power BI'],['ABC','ABC','Power BI Python'],['ABC','ABC','Rstudio Rstudio Rstudio Tableau'],['ABC','ABC','C# C# C# HTML5']]
df = pd.DataFrame(df, columns=['Companyname', 'Jobtitle','Job_description'])

df2 = df.reindex(df.columns.tolist() + ['Programming_Languages','Data_Visualisation'],axis=True)

My current result:

My current result

My desired result:

My desired result


Solution

  • Here's a way to do it if you don't mind the case of the elements in the new columns:

    skills = {
        'Programming_Languages': ['Python','C#','Rstudio','HTML5'],
        'Data_visualization': ['Power BI','Tableau']
    }
    df[list(skills)] = pd.DataFrame([
        df['Job_description'].str.findall('|'.join(skills[skill]),re.IGNORECASE)
            .apply(lambda x: pd.Series(x).str.lower().unique() if len(x) else [])
        for skill in skills], index=list(skills)).T
    

    Output:

      Companyname Jobtitle                                    Job_description Programming_Languages   Data_visualization
    0         ABC      ABC  Python,C#,Power BI,Tableau,python,Power BI,Pow...          [python, c#]  [power bi, tableau]
    1         ABC      ABC                                    Power BI,Python              [python]           [power bi]
    2         ABC      ABC            Rstudio,Rstudio,Rstudio,Tableau,Tableau             [rstudio]            [tableau]
    3         ABC      ABC                                     C#,C#,C#,HTML5           [c#, html5]                   []
    

    Explanation:


    If you want the results to contain the search terms with their original case, you can do this:

    skills = {
        'Programming_Languages': ['Python','C#','Rstudio','HTML5'],
        'Data_visualization': ['Power BI','Tableau']
    }
    skills_de_lower = {key: {x.lower() : x for x in skill_list} for key, skill_list in skills.items()}
    df[list(skills)] = pd.DataFrame([
        df['Job_description'].str.findall('|'.join(skills[skill]),re.IGNORECASE)
            .apply(lambda x: [y[0] for y in 
                pd.DataFrame(pd.Series(x).str.lower().unique())
                    .applymap(lambda x: skills_de_lower[skill][x]).to_numpy()] 
            if len(x) else [])
        for skill in skills], index=list(skills)).T
    

    Output:

      Companyname Jobtitle                                    Job_description Programming_Languages   Data_visualization
    0         ABC      ABC  Python,C#,Power BI,Tableau,python,Power BI,Pow...          [Python, C#]  [Power BI, Tableau]
    1         ABC      ABC                                    Power BI,Python              [Python]           [Power BI]
    2         ABC      ABC            Rstudio,Rstudio,Rstudio,Tableau,Tableau             [Rstudio]            [Tableau]
    3         ABC      ABC                                     C#,C#,C#,HTML5           [C#, HTML5]                   []
    

    Explanation:


    If you want csv values instead of arrays, you can add .str.join(',') just before for skill in skills like this:

    df[list(skills)] = pd.DataFrame([
        df['Job_description'].str.findall('|'.join(skills[skill]),re.IGNORECASE)
            .apply(lambda x: [y[0] for y in 
                pd.DataFrame(pd.Series(x).str.lower().unique())
                    .applymap(lambda x: skills_de_lower[skill][x]).to_numpy()] 
            if len(x) else []).str.join(',')
        for skill in skills], index=list(skills)).T
    

    Output:

      Companyname Jobtitle                                    Job_description Programming_Languages Data_visualization
    0         ABC      ABC  Python,C#,Power BI,Tableau,python,Power BI,Pow...             Python,C#   Power BI,Tableau
    1         ABC      ABC                                    Power BI,Python                Python           Power BI
    2         ABC      ABC            Rstudio,Rstudio,Rstudio,Tableau,Tableau               Rstudio            Tableau
    3         ABC      ABC                                     C#,C#,C#,HTML5              C#,HTML5
    

    Explanation: