pythonsimilarityhamming-distancedistance-matrix

Distance metric for comparing ingredient lists


I am using sklearn pairwise distances to identify the similarity of different products based on their ingredients. My initial df looks like this and contains only 0s and 1s:

Products Ingredient 1 Ingredient 2 ... Ingredient 500
Product 1 0 1 ... 1
Product 2 1 1 ... 0
... ... ... ... ...
Product 600 1 1 ... 1

I have converted this to a distance matrix to receive the distances for each pair of products based on their ingredients and calculated the distance matrix by running the following code:

X = df.to_numpy()
distance_array = pairwise_distances(X, metric='hamming')

I have selected hamming as metric based on this article https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa as I would like to know the absolute number of ingredients that are different between each product pair. However the matrix returns floats like 0.006 for a product combination that differs only by one ingredient, but I would have expected it to return 1 in this case.

Can anyone help me out on this and explain why hamming distance is not returning the absolute numbers? Is there a more suitable metric for my Use Case?

Thanks a lot!!


Solution

  • It states "number of values that are different between two vectors", so yes you would expect to see 1 if only 1 ingredient differs, but the algorithm displays as a percent, not a count. So if 2 of the 3 values differ, that's .6667.

    if you see 0, that means no difference. If you see 1, it means 100% difference (Ie all columns are different when compared.)

    If you want the number of differences though, you'll need to multiple the values by the number of ingredients.

    import pandas as pd
    from sklearn.metrics.pairwise import pairwise_distances
    
    
    data = [
            [0,1,1],
            [1,1,0],
            [1,1,1],
            [0,1,1],
            [0,0,1]]
    
    columns = ['Ing1', 'Ing2','Ing3']
    df = pd.DataFrame(data=data, columns=columns)
    
    
    X = df.to_numpy()
    distance_array = pairwise_distances(df, metric='hamming')
    
    products = ['Product %s' %i for i in range(1, len(df) + 1) ]
    
    distance_matrix = pd.DataFrame(distance_array)
    distance_matrix.set_index = products
    distance_matrix.columns = products
    
    distance_matrix_vals = distance_matrix * len(columns)
    

    enter image description here

    enter image description here

    enter image description here