pythonstringcolorsnlpdifference

How to highlight the differences between two strings in Python?


I want to highlight the differences between two strings in a colour using Python code.

Example 1:

sentence1 = "I'm enjoying the summer breeze on the beach while I do some pilates."
sentence2 = "I am enjoying the summer breeze on the beach while I am doing some pilates."

Expected result (the part marked by asterisks should be in red):

 I *am* enjoying the summer breeze on the beach while I *am doing* some pilates.

Example 2:

sentence1: "My favourite season is Autumn while my sister's favourite season is Winter."
sentence2: "My favourite season is Autumn, while my sister's favourite season is Winter."

Expected result (the comma is different):

"My favourite season is Autumn*,* while my sister's favourite season is Winter." 

I tried this:

sentence1 = "I'm enjoying the summer breeze on the beach while I do some pilates."
sentence2 = "I'm enjoying the summer breeze on the beach while I am doing some pilates."

# Split the sentences into words
words1 = sentence1.split()
words2 = sentence2.split()

# Find the index where the sentences differ
index_of_difference = next((i for i, (word1, word2) in enumerate(zip(words1, words2)) if word1 != word2), None)

# Highlight differing part "am doing" in red
highlighted_words = []
for i, (word1, word2) in enumerate(zip(words1, words2)):
    if i == index_of_difference:
        highlighted_words.append('\033[91m' + word2 + '\033[0m')
    else:
        highlighted_words.append(word2)

highlighted_sentence = ' '.join(highlighted_words)
print(highlighted_sentence)

And I got this:

I'm enjoying the summer breeze on the beach while I *am* doing some

Instead of this:

I'm enjoying the summer breeze on the beach while I *am doing* some pilates.

How can I solve this?


Solution

  • I believe the main issue with your code was with getting the indexes of the differences. Here is a solution that makes use of the built-in Python difflib library:

    from difflib import Differ
    
    # Return string with the escape sequences at specific indexes to highlight
    def highlight_string_at_idxs(string, indexes):
        # hl = "\x1b[38;5;160m"  # 8-bit
        hl = "\x1b[91m"
        reset = "\x1b[0m"
        words_with_hl = []
        for string_idx, word in enumerate(string.split(" ")):
            if string_idx in indexes:
                words_with_hl.append(hl + word + reset)
            else:
                words_with_hl.append(word)
        return " ".join(words_with_hl)
    
    # Return indexes of the additions to s2 compared to s1
    def get_indexes_of_additions(s1, s2):
        diffs = list(Differ().compare(s1.split(" "), s2.split(" ")))
        indexes = []
        adj_idx = 0  # Adjust index to compensate for removed words
        for diff_idx, diff in enumerate(diffs):
            if diff[:1] == "+":
                indexes.append(diff_idx - adj_idx)
            elif diff[:1] == "-":
                adj_idx += 1
        return indexes
    
    sentence1 = "I'm enjoying the summer breeze on the beach while I do some pilates."
    sentence2 = "I am enjoying the summer breeze on the beach while I am doing some pilates."
    addition_idxs = get_indexes_of_additions(sentence1, sentence2)
    hl_sentence2 = highlight_string_at_idxs(sentence2, addition_idxs)
    print(hl_sentence2)
    

    Output

    *I am* enjoying the summer breeze on the beach while I *am doing* some pilates.
    

    Highlighted differences example