pythonnltkspacysimilaritysentence-similarity

What is the best way to get accurate text similarity in python for comparing single words or bigrams?


I've got similar product data in both the products_a array and products_b array:

products_a = [{color: "White", size: "2' 3\""}, {color: "Blue", size: "5' 8\""} ]
products_b = [{color: "Black", size: "2' 3\""}, {color: "Sky blue", size: "5' 8\""} ]

I would like to be able to accurately tell similarity between the colors in the two arrays, with a score between 0 and 1. For example, comparing "Blue" against "Sky blue" should be scored near 1.00 (probably like 0.78 or similar).

Spacy Similarity

I tried using spacy to solve this:

import spacy
nlp = spacy.load('en_core_web_sm')

def similarityscore(text1, text2 ):
    doc1 = nlp( text1 )
    doc2 = nlp( text2 )
    similarity = doc1.similarity( doc2 )
    return similarity

Yeah, well when passing in "Blue" against "Sky blue" it scores it as 0.6545742918773636. Ok, but what happens when passing in "White" against "Black"? The score is 0.8176945362451089... as in spacy is saying "White" against "Black" is ~81% similar! This is a failure when trying to make sure product colors are not similar.

Jaccard Similarity

I tried Jaccard Similarity on "White" against "Black" using this and got a score of 0.0 (maybe overkill on single words but room for future larger corpuses):

# remove punctuation and lowercase all words function
def simplify_text(text):
    for punctuation in ['.', ',', '!', '?', '"']:
        text = text.replace(punctuation, '')
    return text.lower()

# Jaccard function
def jaccardSimilarity(text_a, text_b ):
    word_set_a, word_set_b = [set(self.simplify_text(text).split())
                                for text in [text_a, text_b]]
    num_shared = len(word_set_a & word_set_b)
    num_total = len(word_set_a | word_set_b)
    jaccard = num_shared / num_total
    return jaccard

Getting differing scores of 0.0 and 0.8176945362451089 on "White" against "Black" is not acceptable to me. I keep seeking a more accurate way of solving this issue. Even taking the mean of the two would be not accurate. Please let me know if you have any better ways.


Solution

  • NLP packages may be better at longer text fragments and more sophisticated text analysis.

    As you've discovered with 'black' and 'white', they make assumptions about similarity that are not right in the context of a simple list of products.

    Instead you can see this not as an NLP problem, but as a data transformation problem. This is how I would tackle it.

    To get the unique list of colors in both lists use set operations on the colors found in the two product lists. "set comprehensions" get a unique set of colors from each product list, then a union() on the two sets gets the unique colors from both product lists, with no duplicates. (Not really needed for 4 products, but very useful for 400, or 4000.)

    products_a = [{'color': "White", 'size': "2' 3\""}, {'color': "Blue", 'size': "5' 8\""} ]
    products_b = [{'color': "Black", 'size': "2' 3\""}, {'color': "Sky blue", 'size': "5' 8\""} ]
    
    products_a_colors = {product['color'].lower() for product in products_a}
    products_b_colors = {product['color'].lower() for product in products_b}
    unique_colors = products_a_colors.union(products_b_colors)
    print(unique_colors)
    

    The colors are lowercased because in Python 'Blue' != 'blue' and both spellings are found in your product lists.

    The above code finds these unique colors:

    {'black', 'white', 'sky blue', 'blue'}
    

    The next step is to build an empty color map.

    colormap = {color: '' for color in unique_colors}
    import pprint
    pp = pprint.PrettyPrinter(indent=4, width=10, sort_dicts=True)
    pp.pprint(colormap)
    

    Result:

    {
        'sky blue': '',
        'white': '',
        'black': '',
        'blue': ''
    }
    

    Paste the empty map into your code and fill out mappings for your complex colors like 'Sky blue'. Delete simple colors like 'white', 'black' and 'blue'. You'll see why below.

    Here's an example, assuming a slightly bigger range of products with more complex or unusual colors:

    colormap = {
        'sky blue': 'blue',
        'dark blue': 'blue',
        'bright red': 'red',
        'dark red': 'red',
        'burgundy': 'red'
    }
    

    This function helps you to group together colors that are similar based on your color map. Function color() maps complex colors onto base colors and drops everything into lower case to allow 'Blue' to be considered the same as 'blue'. (NOTE: the colormap dictionary should only use lowercase in its keys.)

    def color(product_color):
        return colormap.get(product_color.lower(), product_color).lower()
    

    Examples:

    >>> color('Burgundy')
    'red'
    >>> color('Sky blue')
    'blue'
    >>> color('Blue')
    'blue'
    

    If a color doesn't have a key in the colormap, it passes through unchanged, except that it is converted to lowercase:

    >>> color('Red')
    'red'
    >>> color('Turquoise')
    'turquoise'
    

    This is the scoring part. The product function from the standard library is used to pair items from product_a with items from product_b. Each pair is numbered using enumerate() because, as will become clear later, a score for a pair is of the form (pair_id, score). This way each pair can have more than one score.

    'cartesian product' is just a mathematical name for what itertools.product() does. I've renamed it to avoid confusion with product_a and product_b. itertools.product() returns all possible pairs between two lists.

    from itertools import product as cartesian_product
    product_pairs = {
        pair_id: product_pair for pair_id, product_pair
        in enumerate(cartesian_product(products_a, products_b))
    }
    print(product_pairs)
    

    Result:

    {0: ({'color': 'White', 'size': '2\' 3"'}, {'color': 'Black', 'size': '2\' 3"'}),
     1: ({'color': 'White', 'size': '2\' 3"'}, {'color': 'Sky blue', 'size': '5\' 8"'}),
     2: ({'color': 'Blue', 'size': '5\' 8"'}, {'color': 'Black', 'size': '2\' 3"'}),
     3: ({'color': 'Blue', 'size': '5\' 8"'}, {'color': 'Sky blue', 'size': '5\' 8"'})
    }
    

    The list will be much longer if you have 100s of products.

    Then here's how you might compile color scores:

    color_scores = [(pair_id, 0.8) for pair_id, (product_a, product_b)
                    in product_pairs.items()
                    if color(product_a['color']) == color(product_b['color'])]
    print(color_scores)
    

    In the example data, one product pair matches via the color() function: pair number 3, with the 'Blue' product in product_a and the 'Sky blue' item in product_b. As the color() function evaluates both 'Sky blue' and 'blue' to the value 'blue', this pair is awarded a score, 0.8:

    [(3, 0.8)]
    

    "deep unpacking" is used to extract product details and the "pair id" of the current product pair, and put them in local variables for processing or display. There's a nice tutorial article about "deep unpacking" here.

    The above is a blueprint for other rules. For example, you could write a rule based on size, and give that a different score, say, 0.5:

    size_scores = [(pair_id, 0.5) for pair_id, (product_a, product_b)
                   in product_pairs.items()
                   if product_a['size'] == product_b['size']]
    print(size_scores)
    
    

    and here are the resulting scores based on the 'size' attribute.

    [(0, 0.5), (3, 0.5)]
    

    This means pair 0 scores 0.5 and pair 3 scores 0.5 because their sizes match exactly.

    To get the total score for a product pair you might average the color and size scores:

    print()
    print("Totals")
    score_sources = [color_scores, size_scores]  # add more scores to this list
    all_scores = sorted(itertools.chain(*score_sources))
    pair_scores = itertools.groupby(all_scores, lambda x: x[0])
    for pair_id, pairs in pair_scores:
        scores = [score for _, score in pairs]
        average = sum(scores) / len(scores)
        print(f"Pair {pair_id}: score {average}")
        for n, product in enumerate(product_pairs[pair_id]):
            print(f"  --> Item {n+1}: {product}")
    

    Results:

    Totals
    Pair 0: score 0.5
      --> Item 1: {'color': 'White', 'size': '2\' 3"'}
      --> Item 2: {'color': 'Black', 'size': '2\' 3"'}
    Pair 3: score 0.65
      --> Item 1: {'color': 'Blue', 'size': '5\' 8"'}
      --> Item 2: {'color': 'Sky blue', 'size': '5\' 8"'}
    

    Pair 3, which matches colors and sizes, has the highest score and pair 0, which matches on size only, scores lower. The other two pairs have no score.