similarityrecommendation-enginemahout-recommender

Items Similarity based on their features


I have a dataset with items but with no user ratings.

Items have features (~400 feature).

I want to measure the similarity between items based on features (Row similarity).

I convert the item-feature into a binary matrix like the fowllowing

itemID | feature1 | feature2 | feature3 | feature4 .... 1 | 0 | 1 | 1 | 0 2 | 1 | 0 | 0 | 1 3 | 1 | 1 | 1 | 0 4 | 0 | 0 | 1 | 1
I don't know what to use (and how to use it) to measure the row similarity.

I want, for Item X, to get the top k similar items.

A sample code will be very appreciated


Solution

  • What you are looking for is termed similarity measure. A quick google/SO search will reveal various methods to get similarity between two vectors. Here is some sample code in python2 for cosine similarity:

    from math import *
    
    def square_rooted(x):
        return round(sqrt(sum([a*a for a in x])),3)
    
    def cosine_similarity(x,y):
        numerator = sum(a*b for a,b in zip(x,y))
        denominator = square_rooted(x)*square_rooted(y)
        return round(numerator/float(denominator),3)
    
    print cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])
    

    taken from: http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

    I noticed that you want top k similar items for every item. The best way to do that is with a k Nearest Neighbour implementation. What you can do is create a knn graph and return the top k similar items from the graph for a query.

    A great library for this is nmslib. Here is some sample code for a knn query from the library for the HNSW method with cosine similarity (you can use one of the several available methods. HNSW is particularly efficient for your high dimensional data):

    import nmslib
    import numpy
    
    # create a random matrix to index
    data = numpy.random.randn(10000, 100).astype(numpy.float32)
    
    # initialize a new index, using a HNSW index on Cosine Similarity
    index = nmslib.init(method='hnsw', space='cosinesimil')
    index.addDataPointBatch(data)
    index.createIndex({'post': 2}, print_progress=True)
    
    # query for the nearest neighbours of the first datapoint
    ids, distances = index.knnQuery(data[0], k=10)
    
    # get all nearest neighbours for all the datapoint
    # using a pool of 4 threads to compute
    neighbours = index.knnQueryBatch(data, k=10, num_threads=4) 
    

    At the end of the code, the k top neighbours for every data point will be stored in the neighbours variable. You can use that for your purposes.