pythonfilebisectionbisect

Bisection search in the sorted lines of an opened file (not loaded in memory)


I have a text file of multiple gigabytes, and the millions of lines are sorted:

aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj

How is it possible, without loading the whole file in memory, to search if a line is existent, with a bisection search? (Possibly in O(log n) in the number of lines)

Is there a function like bisect.bisect_left among the lines of a file f = open('file.txt', 'r') in the Python library?

The window would initially be [a, b] = [0, file_size]. Then it would seek in the file at position m=(a+b)/2, look for the next \n, and read the following line l. If the pattern to search is smaller or greater than l (with lexicographic order), then we continue on [m, b] or [a, m]. Before rolling my own, does this exist in Python?


Solution

  • You can use the mmap built-in module. It provides random access to files (i.e., a file behaves like a large array of bytes stored in the file system). You can find more info here.

    import mmap
    
    def bisect_search(file_path, line):
        line = line.encode()
        with open(file_path, 'r+b') as f:
            mm = mmap.mmap(f.fileno(), 0)
            lo = 0
            hi = mm.size()
            while lo < hi:
                mid = (lo + hi) // 2
                left_endl_idx = mm.rfind(b'\n', lo, mid)
                right_endl_idx = mm.find(b'\n', mid, hi)
                if left_endl_idx == -1:
                    left_endl_idx = lo - 1
                if right_endl_idx == -1:
                    right_endl_idx = hi
                mid_line = mm[left_endl_idx + 1: right_endl_idx]
                if mid_line == line:
                    return True
                if mid_line < line:
                    lo = right_endl_idx + 1
                else:
                    hi = left_endl_idx
        return False
    

    The function returns True if line exists within the file, False otherwise. Let's use the following myfile.txt file to run a few examples:

    aaaaa
    bcijkjf
    dfsdf
    gdfgdfhqiuzhf
    zzdiszfhj
    
    >>> bisect_search('myfile.txt', 'hello')
    False
    >>> bisect_search('myfile.txt', 'aaaaa')
    True
    >>> bisect_search('myfile.txt', 'aaaa')
    False
    >>> bisect_search('myfile.txt', 'dfsdf')
    True
    >>> bisect_search('myfile.txt', 'zzdiszfhjj')
    False
    

    This function should be much faster than a linear search on a big file.

    Note: this code works with \n endings, and not currently with \r\n Windows-style endings (not necessary for OP).