I faced with a very odd issue in difflib library in python. I have two strings as follows and I run get_opcodes
on them like this:
import difflib
str1 = "MatrixElement(MatrixSymbol('Btd', Integer(11), Integer(11)), Integer(0), Integer(9))), Mul(Float('1.0', precision=24), MatrixElement(MatrixSymbol('Btd', Integer(11), Integer(11)), Integer(0), Integer(10))))"
str2 = "MatrixElement(MatrixSymbol('Btd', Integer(11), Integer(11)), Integer(1), Integer(9))), Mul(Float('1.0', precision=24), MatrixElement(MatrixSymbol('Btd', Integer(11), Integer(11)), Integer(1), Integer(10))))"
difflib.SequenceMatcher(None, str1,str2).get_opcodes()
Only in this specific example, the output of the diff is like the following, which is obviously wrong.
[('equal', 0, 69, 0, 69),
('replace', 69, 70, 69, 70),
('equal', 70, 188, 70, 188),
('insert', 188, 188, 188, 201),
('equal', 188, 190, 201, 203),
('replace', 190, 206, 203, 206)]
The correct output should not contain insert
opcode, as nothing new is added.
Is this potentially a bug in difflib?
This is not a bug. There are multiple ways to transform a sequence into another, the one difflib
outputs here is correct.
Although, you are right to wonder why difflib
chose that odd transformation instead of that one:
[('equal', 0, 69, 0, 69),
('replace', 69, 70, 69, 70),
('equal', 70, 188, 70, 188),
('replace', 188, 189, 188, 189),
('equal', 189, 206, 189, 206)]
It comes down to one thing: autojunk=True
junk
!The main algorithm behind generating the opcodes comes from SequenceMatcher.get_matching_blocks
, this method breaks down the provided sequences into matching subsequences.
To do so efficiently, it first parses str2
and builds a dict
where keys are characters of the sequence and values are lists of indices of the corresponding character.
Although, this can be very memory consumming and thus, by default, difflib.SequenceMatcher
will consider some reccuring characters as junk and not store their indices.
From the difflib
doc:
Automatic junk heuristic: [...] If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching. [...]
In your specific case, the culprit is the character (
which is treated as junk. The SequenceMatcher
object is unable to see a matching sequence starting at index 189 because it is a (
.
The simplest way to get the ouput you expected is to set autojunk=False
.
difflib.SequenceMatcher(None, str1, str2, autojunk=False).get_opcodes()
This outputs what you expected:
[('equal', 0, 69, 0, 69),
('replace', 69, 70, 69, 70),
('equal', 70, 188, 70, 188),
('replace', 188, 189, 188, 189),
('equal', 189, 206, 189, 206)]
Although, note that sometimes turning autojunk
off completely might not be the best option since it will likely consume more memory and time. A better approach would be to specify what is considered junk.
[...] these “junk” elements are ones that are uninteresting in some sense, such as blank lines or whitespace [...]
This is especially true when you are using difflib.ratio
to get the measure of similarity between sequences. In that case you might want to ignore whitespaces as they are generally uninteresting in term of text comparsion.
Thus if you turn off autojunk
, you can still provide a isjunk
function that will indicate to ignore, say, whitespaces. This argument is the one you set as None
in your example.
import difflib
from string import whitespace
...
difflib.SequenceMatcher(lambda x: x in whitespace, str1, str2, autojunk=False)