I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.
corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)
But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.
Thanks in advance.
UPDATE
I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.
UPDATE
For example. I have the training data:
["a", "b", "c"]
["a", "b", "d"]
And do TFIDF, the result will contains 4 features(a,b,c,d)
When I TEST:
["a", "c", "d"]
to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"]
, there may have other problems.)
So how to store the features list for testing data (even more, store it in file)?
I successfully saved the feature list by saving vectorizer.vocabulary_
, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)
Codes below:
corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))
#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))
That works. tfidf
will have same feature length as trained data.