I have a text file of multiple gigabytes, and the millions of lines are sorted:
aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj
How is it possible, without loading the whole file in memory, to search if a line is existent, with a bisection search? (Possibly in O(log n) in the number of lines)
Is there a function like bisect.bisect_left
among the lines of a file f = open('file.txt', 'r')
in the Python library?
The window would initially be [a, b] = [0, file_size]
. Then it would seek in the file at position m=(a+b)/2
, look for the next \n
, and read the following line l
. If the pattern to search is smaller or greater than l
(with lexicographic order), then we continue on [m, b]
or [a, m]
. Before rolling my own, does this exist in Python?
You can use the mmap
built-in module. It provides random access to files (i.e., a file behaves like a large array of bytes stored in the file system). You can find more info here.
import mmap
def bisect_search(file_path, line):
line = line.encode()
with open(file_path, 'r+b') as f:
mm = mmap.mmap(f.fileno(), 0)
lo = 0
hi = mm.size()
while lo < hi:
mid = (lo + hi) // 2
left_endl_idx = mm.rfind(b'\n', lo, mid)
right_endl_idx = mm.find(b'\n', mid, hi)
if left_endl_idx == -1:
left_endl_idx = lo - 1
if right_endl_idx == -1:
right_endl_idx = hi
mid_line = mm[left_endl_idx + 1: right_endl_idx]
if mid_line == line:
return True
if mid_line < line:
lo = right_endl_idx + 1
else:
hi = left_endl_idx
return False
The function returns True
if line
exists within the file, False
otherwise. Let's use the following myfile.txt
file to run a few examples:
aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj
>>> bisect_search('myfile.txt', 'hello')
False
>>> bisect_search('myfile.txt', 'aaaaa')
True
>>> bisect_search('myfile.txt', 'aaaa')
False
>>> bisect_search('myfile.txt', 'dfsdf')
True
>>> bisect_search('myfile.txt', 'zzdiszfhjj')
False
This function should be much faster than a linear search on a big file.
Note: this code works with \n
endings, and not currently with \r\n
Windows-style endings (not necessary for OP).