I have a situation where I need to extract the skills of a particular applicant who is applying for a job from the job description avaialble and store it as a new column altogether. The dataframe X looks like following:
Job_ID Job_Desc
1 Applicant should posses technical capabilities including proficient knowledge of python and SQL
2 Applicant should posses technical capabilities including proficient knowledge of python and SQL and R
The resultant output should look like following:
Job_ID Skills
1 Python,SQL
2 Python,SQL,R
I have used tf-idf count vectorizer to get the most important words within the Job_Desc column but still I am not able to get the desired skills data in the output. Could this be achieved somehow with Word2Vec using skip gram or CBOW model?
My code looks like this :
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_df=0.50)
word_count_vector=cv.fit_transform(X)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)
def sort_coo(coo_matrix):
tuples = zip(coo_matrix.col, coo_matrix.data)
return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
"""get the feature names and tf-idf score of top n items"""
#use only topn items from vector
sorted_items = sorted_items[:topn]
score_vals = []
feature_vals = []
for idx, score in sorted_items:
fname = feature_names[idx]
#keep track of feature name and its corresponding score
score_vals.append(round(score, 3))
feature_vals.append(feature_names[idx])
#create a tuples of feature,score
#results = zip(feature_vals,score_vals)
results= {}
for idx in range(len(feature_vals)):
results[feature_vals[idx]]=score_vals[idx]
return results
feature_names=cv.get_feature_names()
doc=X[0]
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
sorted_items=sort_coo(tf_idf_vector.tocoo())
keywords=extract_topn_from_vector(feature_names,sorted_items,10)
print("\n=====Title=====")
print(X[0])
print("\n===Keywords===")
for k in keywords:
print(k,keywords[k])
I can't think of a way that TF-IDF, Word2Vec, or other simple/unsupervised algorithms could, alone, identify the kinds of 'skills' you need.
You'll likely need a large hand-curated list of skills – at the very least, as a way to automate the evaluation of methods that purport to extract skills.
With a curated list, then something like Word2Vec might help suggest synonyms, alternate-forms, or related-skills. (For known skill X, and a large Word2Vec model on your text, terms similar-to X are likely to be similar skills – but not guaranteed, so you'd likely still need human review/curation.)
With a large-enough dataset mapping texts to outcomes – like, a candidate-description text (resume) mapped-to whether a human reviewer chose them for an interview, or hired them, or they succeeded in a job, you might be able to identify terms that are highly predictive of fit in a certain job role. Those terms might often be de facto 'skills'. But discovering those correlations could be a much larger learning project.