I have a vocabulary list that include n-grams as follows.
myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']
I want to use these words to calculate TF-IDF values.
I also have a dictionary of corpus as follows (key = recipe number, value = recipe).
corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
I am currently using the following code.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
Now I am printing tokens or n-grams of the recipe 1 in corpus
along with the tF-IDF value as follows.
feature_names = tfidf.get_feature_names()
doc = 0
feature_index = tfs[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfs[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
print(w, s)
The results I get is chocolates 1.0
. However, my code does not detect n-grams (bigrams) such as biscuit pudding
when calculating TF-IDF values. Please let me know where I make the code wrong.
I want to get the TD-IDF matrix for myvocabulary
terms by using the recipe documents in the corpus
. In other words, the rows of the matrix represents myvocabulary
and the columns of the matrix represents the recipe documents of my corpus
. Please help me.
Try increasing the ngram_range
in TfidfVectorizer
:
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english', ngram_range=(1,2))
Edit: The output of TfidfVectorizer
is the TF-IDF matrix in sparse format (or actually the transpose of it in the format you seek). You can print out its contents e.g. like this:
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
print((feature_names[col], corpus_index[row]), tfs[row, col])
which should yield
('biscuit pudding', 1) 0.646128915046
('chocolates', 1) 0.763228291628
('chocolates', 2) 0.508542320378
('tim tam', 2) 0.861036995944
('chocolates', 3) 0.508542320378
('fresh milk', 3) 0.861036995944
If the matrix is not large, it might be easier to examine it in dense form. Pandas
makes this very convenient:
import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
print(df)
This results in
1 2 3
tim tam 0.000000 0.861037 0.000000
jam 0.000000 0.000000 0.000000
fresh milk 0.000000 0.000000 0.861037
chocolates 0.763228 0.508542 0.508542
biscuit pudding 0.646129 0.000000 0.000000