pythondata-miningtext-miningsimilarity

Calculate similarity between list of words


I want to calculate the similarity between two list of words, for example :

['email','user','this','email','address','customer']

is similar to this list:

['email','mail','address','netmail']

I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list.


Solution

  • Since you haven't really been able to demonstrate a crystal output, here is my best shot:

    list_A = ['email','user','this','email','address','customer']
    list_B = ['email','mail','address','netmail']
    

    In the above two list, we will find the cosine similarity between each element of the list with the rest. i.e. email from list_B with every element in list_A:

    def word2vec(word):
        from collections import Counter
        from math import sqrt
    
        # count the characters in word
        cw = Counter(word)
        # precomputes a set of the different characters
        sw = set(cw)
        # precomputes the "length" of the word vector
        lw = sqrt(sum(c*c for c in cw.values()))
    
        # return a tuple
        return cw, sw, lw
    
    def cosdis(v1, v2):
        # which characters are common to the two words?
        common = v1[1].intersection(v2[1])
        # by definition of cosine distance we have
        return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]
    
    
    list_A = ['email','user','this','email','address','customer']
    list_B = ['email','mail','address','netmail']
    
    threshold = 0.80     # if needed
    for key in list_A:
        for word in list_B:
            try:
                # print(key)
                # print(word)
                res = cosdis(word2vec(word), word2vec(key))
                # print(res)
                print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100))
                # if res > threshold:
                #     print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
            except IndexError:
                pass
    

    OUTPUT:

    The cosine similarity between : email and : email is: 100.0
    The cosine similarity between : mail and : email is: 89.44271909999159
    The cosine similarity between : address and : email is: 26.967994498529684
    The cosine similarity between : netmail and : email is: 84.51542547285166
    The cosine similarity between : email and : user is: 22.360679774997898
    The cosine similarity between : mail and : user is: 0.0
    The cosine similarity between : address and : user is: 60.30226891555272
    The cosine similarity between : netmail and : user is: 18.89822365046136
    The cosine similarity between : email and : this is: 22.360679774997898
    The cosine similarity between : mail and : this is: 25.0
    The cosine similarity between : address and : this is: 30.15113445777636
    The cosine similarity between : netmail and : this is: 37.79644730092272
    The cosine similarity between : email and : email is: 100.0
    The cosine similarity between : mail and : email is: 89.44271909999159
    The cosine similarity between : address and : email is: 26.967994498529684
    The cosine similarity between : netmail and : email is: 84.51542547285166
    The cosine similarity between : email and : address is: 26.967994498529684
    The cosine similarity between : mail and : address is: 15.07556722888818
    The cosine similarity between : address and : address is: 100.0
    The cosine similarity between : netmail and : address is: 22.79211529192759
    The cosine similarity between : email and : customer is: 31.62277660168379
    The cosine similarity between : mail and : customer is: 17.677669529663685
    The cosine similarity between : address and : customer is: 42.640143271122085
    The cosine similarity between : netmail and : customer is: 40.08918628686365
    

    Note: I have also commented the threshold part in the code, in case you only want the words if their similarity exceeds a certain threshold i.e. 80%

    EDIT:

    OP: but what i want exactly to do in not the comparaison word by word but, list by list

    Using Counter and math:

    from collections import Counter
    import math
    
    counterA = Counter(list_A)
    counterB = Counter(list_B)
    
    
    def counter_cosine_similarity(c1, c2):
        terms = set(c1).union(c2)
        dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
        magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
        magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
        return dotprod / (magA * magB)
    
    print(counter_cosine_similarity(counterA, counterB) * 100)
    

    OUTPUT:

    53.03300858899106