pythondifflib

difflib cannot correctly find the opcodes


I faced with a very odd issue in difflib library in python. I have two strings as follows and I run get_opcodes on them like this:

import difflib

str1 = "MatrixElement(MatrixSymbol('Btd', Integer(11), Integer(11)), Integer(0), Integer(9))), Mul(Float('1.0', precision=24), MatrixElement(MatrixSymbol('Btd', Integer(11), Integer(11)), Integer(0), Integer(10))))"
str2 = "MatrixElement(MatrixSymbol('Btd', Integer(11), Integer(11)), Integer(1), Integer(9))), Mul(Float('1.0', precision=24), MatrixElement(MatrixSymbol('Btd', Integer(11), Integer(11)), Integer(1), Integer(10))))"
difflib.SequenceMatcher(None, str1,str2).get_opcodes()

Only in this specific example, the output of the diff is like the following, which is obviously wrong.

[('equal', 0, 69, 0, 69),
 ('replace', 69, 70, 69, 70),
 ('equal', 70, 188, 70, 188),
 ('insert', 188, 188, 188, 201),
 ('equal', 188, 190, 201, 203),
 ('replace', 190, 206, 203, 206)]

The correct output should not contain insert opcode, as nothing new is added.

Is this potentially a bug in difflib?


Solution

  • This is not a bug. There are multiple ways to transform a sequence into another, the one difflib outputs here is correct.

    Although, you are right to wonder why difflib chose that odd transformation instead of that one:

    [('equal', 0, 69, 0, 69),
     ('replace', 69, 70, 69, 70),
     ('equal', 70, 188, 70, 188),
     ('replace', 188, 189, 188, 189),
     ('equal', 189, 206, 189, 206)]
    

    It comes down to one thing: autojunk=True

    Prepare to learn about junk!

    The main algorithm behind generating the opcodes comes from SequenceMatcher.get_matching_blocks, this method breaks down the provided sequences into matching subsequences.

    To do so efficiently, it first parses str2 and builds a dict where keys are characters of the sequence and values are lists of indices of the corresponding character.

    Although, this can be very memory consumming and thus, by default, difflib.SequenceMatcher will consider some reccuring characters as junk and not store their indices.

    From the difflib doc:

    Automatic junk heuristic: [...] If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching. [...]

    In your specific case, the culprit is the character ( which is treated as junk. The SequenceMatcher object is unable to see a matching sequence starting at index 189 because it is a (.

    Handling junk

    The simplest way to get the ouput you expected is to set autojunk=False.

    difflib.SequenceMatcher(None, str1, str2, autojunk=False).get_opcodes()
    

    This outputs what you expected:

    [('equal', 0, 69, 0, 69),
     ('replace', 69, 70, 69, 70),
     ('equal', 70, 188, 70, 188),
     ('replace', 188, 189, 188, 189),
     ('equal', 189, 206, 189, 206)]
    

    Although, note that sometimes turning autojunk off completely might not be the best option since it will likely consume more memory and time. A better approach would be to specify what is considered junk.

    [...] these “junk” elements are ones that are uninteresting in some sense, such as blank lines or whitespace [...]

    This is especially true when you are using difflib.ratio to get the measure of similarity between sequences. In that case you might want to ignore whitespaces as they are generally uninteresting in term of text comparsion.

    Thus if you turn off autojunk, you can still provide a isjunk function that will indicate to ignore, say, whitespaces. This argument is the one you set as None in your example.

    import difflib
    from string import whitespace
    
    ...
    
    difflib.SequenceMatcher(lambda x: x in whitespace, str1, str2, autojunk=False)