pythontextcluster-analysisk-meansminhash

k-means using signature matrix generated from minhash


I have used minhash on documents and their shingles to generate a signature matrix from these documents. I have verified that the signature matrices are good as comparing jaccard distances of known similar documents (say, two articles about the same sports team or two articles about the same world event) give correct readings.

My question is: does it make sense to use this signature matrix to perform k-means clustering?

I've tried using the signature vectors of documents and calculating the euclidean distance of these vectors inside the iterative kmeans algorithm and I always get nonsense for my clusters. I know there should be two clusters (my data set is a few thousands articles about either sports or business) and in the end my two clusters are always just random. I'm convinced that the randomness of hashing words into integers is going to skew the distance function every time and overpower similar hash values in two signature matrices.

[Edited to highlight the question]


Solution

  • TL;DR

    Short answer: No, it doesn't make sense to use the signature matrix for K-means clustering. At least, not without significant manipulation.

    Some explanations

    I'm coming at this after a few days of figuring out how to do the same thing (text clustering) myself. I might be wrong, but my perception is that you're making the same mistake I was: using MinHash to build an [n_samples x n_perms] matrix, then using this as a features matrix X on which you run k-means.

    I'm guessing you're doing something like:

    # THIS CODE IS AN EXAMPLE OF WRONG! DON'T IMPLEMENT!
    import numpy as np
    import MinHash
    from sklearn.cluster import KMeans
    # Get your data. 
    data = get_your_list_of_strings_to_cluster()
    n_samples = len(data)
    # Minhash all the strings
    n_perms = 128
    minhash_values = np.zeros((n_samples, n_perms), dtype='uint64')
    minhashes = []
    for index, string in enumerate(data):
        minhash = MinHash(num_perm=n_perms)
        for gram in ngrams(string, 3):
             minhash.update("".join(gram).encode('utf-8'))
         minhash_values[index, :] = minhash.hashvalues
    # Compute clusters
    clusterer = KMeans(n_clusters=8)
    clusters = clusterer.fit_predict(minhash_values)
    

    This will behave horribly because of the fateful flaw - the minhash_values array is not a feature matrix. Each row is basically a list of features (hashes) which appear in that sample of text... but they're not column-aligned so features are scattered into the wrong dimensions.

    To turn that into a feature matrix, you'd have to look at all the unique hashes in minhash_values then create a matrix which is [n_samples x n_unique_hashes], (n_unique_hashes is the number of unique features found) setting it to 1 where the text sample contains that feature, 0 elsewhere. Typically this matrix would be large and sparse. You could then cluster on that.

    Alternative way of text clustering

    What an unbelievable hassle though! Fortunately, scikit-learn is there to help. It provides some very easy to use and scalable vectorisers:

    So your problem becomes easily solved:

    # Imports
    from sklearn.feature_extraction.text import HashingVectorizer
    from sklearn.cluster import KMeans
    
    # Get your data
    data = get_your_list_of_strings_to_cluster()
    
    # Get your feature matrix
    text_features = HashingVectorizer(analyzer="word").fit_transform(data)
    
    # Compute clusters
    clusterer = KMeans(n_clusters=2)
    clusters = clusterer.fit_predict(text_features)
    

    And there you go. From there:

    Hope this helps.

    Tom