My question is similar to this question raised However, The solution provided appears to work where only one element of the list will exist in the dataframe column.
My issue is that multiple values from the list might exist in each row of the df and the list values might be referenced more than once.
I have a dataframe which includes a number of job listings. The Job description column contains the full job description. To extract the key skill requirements from the job description column a number of additional columns were created. eg. Programming_Languages,Data visualisation etc
Each column has a list of values to check for there occurrence in the job description eg.
pl_list=['Python','RStudio','C#','HTML5']
dv_list=['Power BI,Tableau']
My current solution:
df2['Programming_Languages'] = df2['Programming_Languages'].str.findall('|'.join(pl_list),re.IGNORECASE)
df2['Data_Visualisation'] = df2['Data_Visualisation'].str.findall('|'.join(dv_list),re.IGNORECASE)
The job description might reference the values in a list multiple times. so if python and Rstudio are referenced more than once it will return each occurrence. I would just like it to return the value once.
Code to recreate df
df=[['ABC','ABC','Python, C#, Power BI Tableau python Power BI'],['ABC','ABC','Power BI Python'],['ABC','ABC','Rstudio Rstudio Rstudio Tableau'],['ABC','ABC','C# C# C# HTML5']]
df = pd.DataFrame(df, columns=['Companyname', 'Jobtitle','Job_description'])
df2 = df.reindex(df.columns.tolist() + ['Programming_Languages','Data_Visualisation'],axis=True)
My current result:
My desired result:
Here's a way to do it if you don't mind the case of the elements in the new columns:
skills = {
'Programming_Languages': ['Python','C#','Rstudio','HTML5'],
'Data_visualization': ['Power BI','Tableau']
}
df[list(skills)] = pd.DataFrame([
df['Job_description'].str.findall('|'.join(skills[skill]),re.IGNORECASE)
.apply(lambda x: pd.Series(x).str.lower().unique() if len(x) else [])
for skill in skills], index=list(skills)).T
Output:
Companyname Jobtitle Job_description Programming_Languages Data_visualization
0 ABC ABC Python,C#,Power BI,Tableau,python,Power BI,Pow... [python, c#] [power bi, tableau]
1 ABC ABC Power BI,Python [python] [power bi]
2 ABC ABC Rstudio,Rstudio,Rstudio,Tableau,Tableau [rstudio] [tableau]
3 ABC ABC C#,C#,C#,HTML5 [c#, html5] []
Explanation:
Series.apply()
with a lambda that, for each Series value (namely, for each list in the Series), creates a Series from the list and uses .str.lower().unique()
to de-dupe the list after first making all its items lowercase.str
) and just supply an empty listfor skill in skills
to do the above for each item in the skills
dict (which contains an entry for Programming_Languages
and another for Data_visualization
, and could be extended to include additional skills if ever needed)DataFrame()
with index
argument set to the keys in skills
, transpose it, and assign to new columns in the input dataframe named after the keys in skills
.If you want the results to contain the search terms with their original case, you can do this:
skills = {
'Programming_Languages': ['Python','C#','Rstudio','HTML5'],
'Data_visualization': ['Power BI','Tableau']
}
skills_de_lower = {key: {x.lower() : x for x in skill_list} for key, skill_list in skills.items()}
df[list(skills)] = pd.DataFrame([
df['Job_description'].str.findall('|'.join(skills[skill]),re.IGNORECASE)
.apply(lambda x: [y[0] for y in
pd.DataFrame(pd.Series(x).str.lower().unique())
.applymap(lambda x: skills_de_lower[skill][x]).to_numpy()]
if len(x) else [])
for skill in skills], index=list(skills)).T
Output:
Companyname Jobtitle Job_description Programming_Languages Data_visualization
0 ABC ABC Python,C#,Power BI,Tableau,python,Power BI,Pow... [Python, C#] [Power BI, Tableau]
1 ABC ABC Power BI,Python [Python] [Power BI]
2 ABC ABC Rstudio,Rstudio,Rstudio,Tableau,Tableau [Rstudio] [Tableau]
3 ABC ABC C#,C#,C#,HTML5 [C#, HTML5] []
Explanation:
skills_de_lower
to map from the lowercase skill values used to de-dupe, back to the original skill values we searched forapplymap()
and y[0] for y in ...
within the list comprehension to do the mapping work and some list-of-list unpacking to get results in the same form as before (a list of strings).If you want csv values instead of arrays, you can add .str.join(',')
just before for skill in skills
like this:
df[list(skills)] = pd.DataFrame([
df['Job_description'].str.findall('|'.join(skills[skill]),re.IGNORECASE)
.apply(lambda x: [y[0] for y in
pd.DataFrame(pd.Series(x).str.lower().unique())
.applymap(lambda x: skills_de_lower[skill][x]).to_numpy()]
if len(x) else []).str.join(',')
for skill in skills], index=list(skills)).T
Output:
Companyname Jobtitle Job_description Programming_Languages Data_visualization
0 ABC ABC Python,C#,Power BI,Tableau,python,Power BI,Pow... Python,C# Power BI,Tableau
1 ABC ABC Power BI,Python Python Power BI
2 ABC ABC Rstudio,Rstudio,Rstudio,Tableau,Tableau Rstudio Tableau
3 ABC ABC C#,C#,C#,HTML5 C#,HTML5
Explanation:
.str.join(',')
, the .str
part is actually acting on the list in each value (here str
does not mean what its name implies, but something closer to "sequence", which in our case means the list).str.join()
for an example of how this works only on lists of strings.