pythonbiopythondifflibrosalind

Strange behaviour of Python difflib library for sequence matcher


I am somewhat puzzled by a strange behaviour in the difflib library. I try to find overlapping sequences in strings (actually Fasta sequences from a Rosalind task) to glue them together. The code adapted from here works well with a smaller length of the string (for the sake of the clarity, I construct here an example from a common substring a):

import difflib

def glue(seq1, seq2):     
    s = difflib.SequenceMatcher(None, seq1, seq2)
    start1, start2, overlap = s.find_longest_match(0, len(seq1), 0, len(seq2)) 
    if start1 + overlap == len(seq1) and start2 == 0:
        return seq1 + seq2[overlap:]
    #no overlap found, return -1
    return -1


a = "ABCDEFG"
s1 = "XXX" + a
s2 = a + "YYY"

print(glue(s1, s2))

Output

XXXABCDEFGYYY

But when the string is longer, difflib doesn't find the match any more.

a = "AGGTGTGCCTGTGTCTATACATCGTACGCGGGAAGGTCCAAGTTAACATGGGGTACTGTAATGCACACGTACGCGGGAAGGTCCAAGTTAACTACGAAACGCGAGCCCATCTTTGCCGGTGTTAACTTGCTGTCAGGTGTTTGGCAAGGATCTTTGTTTGCCGGTGTTAACTTGCTGTCAGGTGTTTGGCCGGTGTTAACTTGCTGTCAGATGCGCGCCACGGCCAAATTCTAGGCACGCCAAATTCTAGGCACTTTAAGTGGTTCGATGATCCACGATGGTAAGCCAGCCGTACTTGC"

s1 = "XXX" + a
s2 = a + "YYY"

print(glue(s1, s2))

Output

-1

Why does this happen and how can you use difflib with longer strings?


Solution

  • I noticed that this behaviour starts, when the "ATCG" string is longer than 200. Looking for this number on the difflib documentation page brought me to this passage:

    Automatic junk heuristic: SequenceMatcher supports a heuristic that automatically treats certain sequence items as junk. The heuristic counts how many times each individual item appears in the sequence. If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching. This heuristic can be turned off by setting the autojunk argument to False when creating the SequenceMatcher.

    Since the sequence only contains ATCG, they are of course more than 1% of a sequence of 200 characters and hence considered junk. Changing the code to

    import difflib
    
    def glue(seq1, seq2):     
        s = difflib.SequenceMatcher(None, seq1, seq2, autojunk = False)
        start1, start2, overlap = s.find_longest_match(0, len(seq1), 0, len(seq2)) 
        if start1 + overlap == len(seq1) and start2 == 0:
            return seq1 + seq2[overlap:]
        return -1
    

    gets rid of this behaviour and allows for substring matching without restriction.
    I thought I leave this here on SO, because I wasted hours combing through my code to find this problem.