pythonscikit-learnk-meanssvdlsa

Why use LSA before K-Means when doing text clustering


I'm following this tutorial from Scikit learn on text clustering using K-Means: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

In the example, optionally LSA (using SVD) is used to perform dimensionality reduction.

Why is this useful ? The number of dimensions (features) can already be controlled in the TF-IDF vectorizer using the "max_features" parameter.

I understand that LSA (and LDA) are also topic modelling techniques. The difference with clustering is that documents belong to multiple topics, but only to one cluster. I do not understand why LSA would be used in the context of K-Means clustering.

Example code:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

documents = ["some text", "some other text", "more text"]

tfidf_vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000, min_df=2, stop_words='english', use_idf=True)
X = tfidf_vectorizer.fit_transform(documents)

svd = TruncatedSVD(1000)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
Xnew = lsa.fit_transform(X)

model = KMeans(n_clusters=10, init='k-means++', max_iter=100, n_init=1, verbose=False)
model.fit(Xnew)

Solution

  • There is a paper that shows that PCA eigenvectors are good initializers for K-Means.

    Controlling the dimension with the max_features parameter is equivalent to cutting off the vocabulary size which has negative effects. For example if you set max_features to 10 the model will work with the most common 10 words in the corpus and ignore the rest.