This is a question arising from a Coursera course on genomics that I'm following. We are currently dealing with search methods, specifically how to detect (align) all occurrences of a shorter DNA sequence (p, a string) within a longer DNA sequence (t, also a string). The approach in question involves the creation of a hash table for t composed of all possible substrings of length k AND their positions (each substring is a so-called kmer). We then search the hash table with substrings of the same length (kmers) derived from p. This involves the creation of an Index class and the eventual use of Python's bisect_left function. Here's the code:
import bisect
import sys
class Index(object):
def __init__(self, t, k):
''' Create index from all substrings of size 'length' '''
self.k = k # k-mer length (k)
self.index = []
for i in range(len(t) - k + 1): # for each k-mer
self.index.append((t[i:i+k], i)) # append (k-mer, offset) pair (tuple)
self.index.sort() # alphabetize by k-mer
def query(self, p):
''' Return index hits for first k-mer of P '''
kmer = p[:self.k] # query with first k-mer
i = bisect.bisect_left(self.index, (kmer, -1)) # binary search
hits = []
while i < len(self.index): # collect matching index entries
if self.index[i][0] != kmer:
break
hits.append(self.index[i][1])
i += 1
return hits
My question concerns the format of the bisect_left function. Python docs gives the syntax for this function as follows:
bisect.bisect_left(a, x, lo=0, hi=len(a), *, key=None)
The bisect_left function used in my course is formatted in this way:
i = bisect.bisect_left(self.index, (kmer, -1))
I understand that the parameter a has been passed the argument self.index, which is the hash table created from t. But then I'm flummoxed by the tuple which follows. I don't see how any of the following parameters accepts a tuple. I can imagine that the hash table is organized as tuples of the form (kmer, position), but none of these entries will have the position -1. The instructor states that he sets the value to -1 for a specific purpose as follows:
"So the first step is to just find the first k basis of p to look them up in the tables. So, I'm going to say kmer equals p up to length k, and we need to do self.k. And now I want to find the first position in the list where this kmer occurs, so I'm going to use bisect to do that quickly. And say i = bisect.bisect_left. And I pass in the list that we're searching for, and then the object that we're searching for. And remember this is a list of tuples so I'm going to pass in the kmer, and then I'm going to put -1. And since all the indices in the list are greater than negative one, this assures that we always get the first occurrence of that."
I mean, I understand the English, but I don't see how bisect can (a) take in a tuple as an argument, and (b) assuming it can (obviously it can), how can it make use of a position, -1, that doesn't exist in the hash table? I have a hunch I'm not cluing in to what's going on in terms of sorting, but I can't put my finger on it.
Many thanks, and sorry for what must be the overly long statement of my question.
In Python, you can compare tuples lexicographically (like in a dictionary). First, the first elements are compared, then the second elements, and so on.
>>> (1, 2, 3) < (1, 2, 4)
True
>>> (1, 3, 3) < (1, 2, 4)
False
>>> ('a', 'p', 'p', 'l', 'e') > ('a', 'r', 't')
False
>>>
bisect.bisect_left
is not restricted to integers or strings, and it works with any objects that can be compared. In fact, you can see how it's implemented here: GitHub link
As the documentation on bisect_left says, it searches for the index at which to insert the new tuple so that the list is still sorted.
For instance, this list is sorted:
[("a", 1), ("b", -42), ("c", 2), ("c", 3)]
And this is not:
[("a", 1), ("c", 3), ("b", -42), ("c", 3)]
Example of using bisect.bisect_left
with tuples:
from bisect import bisect_left
# The shopping list must be ordered by food type first, by
# quantity second.
shopping_list = [("apples", 100), ("bananas", 42), ("grapes", 250)]
# Say, we want to figure out where to put our cherries.
print(bisect_left(shopping_list, ("cherries", 666))) # 2
# Our new list might look like this:
[("apples", 100), ("bananas", 42), ("cherries", 666), ("grapes", 250)]
# (just for illustration, bisect_left doesn't change the list)
# What if we want to find out where the bananas start?
print(bisect_left(shopping_list, ("bananas",))) # 1
Thanks to Kelly Bundy's comment: (kmer,)
is a better value to use than (kmer, -1)
. ("foo",)
is indeed less than ("foo", whatever)
. This is better if your requirements change and now -1
is a valid second element.
P.S. The self.index
list is not a hash table. It's simply a sorted list of tuples. It doesn't have any special searching capabilities (besides allowing binary search, which is what you're doing). There is a great talk by Brandon Rhodes and this SO question for an explanation of hash tables.