python-3.x machine-learning word2vec pos-tagger tfidfvectorizer

Extracting skills from a job description using TF-IDF or Word2Vec

I have a situation where I need to extract the skills of a particular applicant who is applying for a job from the job description avaialble and store it as a new column altogether. The dataframe X looks like following:

Job_ID        Job_Desc 
1             Applicant should posses technical capabilities including proficient knowledge of python and SQL
2             Applicant should posses technical capabilities including proficient knowledge of python and SQL and R

The resultant output should look like following:

Job_ID       Skills
1            Python,SQL
2            Python,SQL,R

I have used tf-idf count vectorizer to get the most important words within the Job_Desc column but still I am not able to get the desired skills data in the output. Could this be achieved somehow with Word2Vec using skip gram or CBOW model?

My code looks like this :

from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_df=0.50)
word_count_vector=cv.fit_transform(X)

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

def sort_coo(coo_matrix):
tuples = zip(coo_matrix.col, coo_matrix.data)
return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
"""get the feature names and tf-idf score of top n items"""

#use only topn items from vector
sorted_items = sorted_items[:topn]

score_vals = []
feature_vals = []

for idx, score in sorted_items:
    fname = feature_names[idx]

    #keep track of feature name and its corresponding score
    score_vals.append(round(score, 3))
    feature_vals.append(feature_names[idx])

#create a tuples of feature,score
#results = zip(feature_vals,score_vals)
results= {}
for idx in range(len(feature_vals)):
    results[feature_vals[idx]]=score_vals[idx]

return results

feature_names=cv.get_feature_names()
doc=X[0]

tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
sorted_items=sort_coo(tf_idf_vector.tocoo())
keywords=extract_topn_from_vector(feature_names,sorted_items,10)
print("\n=====Title=====")
print(X[0])
print("\n===Keywords===")
for k in keywords:
   print(k,keywords[k])

Solution

I can't think of a way that TF-IDF, Word2Vec, or other simple/unsupervised algorithms could, alone, identify the kinds of 'skills' you need.

You'll likely need a large hand-curated list of skills – at the very least, as a way to automate the evaluation of methods that purport to extract skills.

With a curated list, then something like Word2Vec might help suggest synonyms, alternate-forms, or related-skills. (For known skill X, and a large Word2Vec model on your text, terms similar-to X are likely to be similar skills – but not guaranteed, so you'd likely still need human review/curation.)

With a large-enough dataset mapping texts to outcomes – like, a candidate-description text (resume) mapped-to whether a human reviewer chose them for an interview, or hired them, or they succeeded in a job, you might be able to identify terms that are highly predictive of fit in a certain job role. Those terms might often be de facto 'skills'. But discovering those correlations could be a much larger learning project.