I'm trying to re-write a code wrote (that it's in Python), but now in spark.
#pandas
tfidf = TfidfVectorizer()
df_final = np.array(tfidf.fit_transform(df['sentence']).todense())
I read on spark documentation, is it necessary to use Tokenizer, HashingTF and then IDF to model tf-idf in PySpark?
#pyspark
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
tokenizer = Tokenizer(inputCol = "sentence", outputCol = "words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol = "words", outputCol="rawFeatures", numFeatures = 20)
tf = hashingTF.transform(wordsData)
idf = IDF(inputCol = "rawFeatures", outputCol = "features")
tf_idf = idf.fit(tf)
df_final = tf_idf.transform(tf)
I'm not sure if you understand clearly how tf-idf
model works, since tokenizing is essential and fundamental for tf-idf
model no matter in sklearn
or spark.ml
version. You post actually cover 2 questions:
tf-idf
need to tokenization the sentence
: I won't copy the mathematical equation since it's easy to search in google. Long in short, tf-idf
is a statistical measurement to evaluate the relevancy and relationship between a word to a document in a collection of documents, which is calculated by the how frequent a word appear in a document (tf
) and the inverse frequency of the word across a set of documents (idf
). Therefore, as the essence is the vocabulary and all calculation are based on vocaulary, if your input is sentence
like what you mentioned in your sklearn
version, you must do the tokenizing of the sentence before the calculation, otherwise the whole methodology is not valid anymore.tf-idf
work in sklearn
: If you understand how tf-idf
works, then you should understand the different steps in the example of spark
official document are essential. Thanks for the sklearn
developer to create such convenient API, you can use the .fit_transform()
directly with the Series of sentence
. In fact, if you check the source code of the TfidfVectorizer
in sklearn, you can see that it actually did the "tokenization", just in a different way:
CountVectorizer
(https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1717)._count_vocab()
method in CountVectorizer
to transform your sentence. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1338)._count_vocab()
, it checks each sentences and create the sparse matrix to store the frequency of each vocabulary in each sentences before the tf-idf
calculation. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1192)To conclude, tokenizing the sentence
is essential for the tf-idf
model calculation, the example in spark
official documents is efficient enough for your model building. Remember to use the function or method if spark
provide such API and DON'T try to build the user defined function/class to achieve the same goal, otherwise it may reduce your computing performance or trigger other issue like out-of-memory.