pythonregexiteratorbytestream

Regular expression parsing (streaming) a binary file?


I'm trying to implement a strings(1)-like function in Python.

import re

def strings(f, n=5):
    # TODO: support files larger than available RAM
    return re.finditer(br'[!-~\s]{%i,}' % n, f.read())

if __name__ == '__main__':
    import sys
    with open(sys.argv[1], 'rb') as f:
        for m in strings(f):
            print(m[0].decode().replace('\x0A', '\u240A'))

Setting aside the case of actual matches* that are larger than the available RAM, the above code fails in the case of files that are merely, themselves, larger than the available RAM!

An attempt to naively "iterate over f" will be done linewise, even for binary files; this may be inappropriate because (a) it may return different results than just running the regex on the whole input, and (b) if the machine has 4 gigabytes of RAM and the file contains any match for rb'[^\n]{8589934592,}', then that unasked-for match will cause a memory problem anyway!

Does Python's regex library enable any simple way to stream re.finditer over a binary file?

*I am aware that it is possible to write regular expressions that may require an exponential amount of CPU or RAM relative to their input length. Handling these cases is, obviously, out-of-scope; I'm assuming for the purposes of the question that the machine at least has enough resource to handle the regex, its largest match on the input, the acquisition of this match, and the ignoring-of all nonmatches.



Solution

  • Does Python's regex library enable any simple way to stream re.finditer over a binary file?

    Well, while typing up the question in such excruciating detail and getting suppporting documentation, I found the solution:

    mmap — Memory-mapped file support

    Memory-mapped file objects behave like both bytearray and like file objects. You can use mmap objects in most places where bytearray are expected; for example, you can use the re module to search through a memory-mapped file.

    Enacted:

    import re, mmap
    
    def strings(f, n=5):
        view = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        return re.finditer(br'[!-~\s]{%i,}' % n, view)
    

    Caveat: on 32-bit systems, this might not work for files larger than 2GiB, if the underlying standard library is deficient.

    However, it looks like it should be fine on both Windows and any well-maintained Linux distribution:

    13.8 Memory-mapped I/O

    Since mmapped pages can be stored back to their file when physical memory is low, it is possible to mmap files orders of magnitude larger than both the physical memory and swap space. The only limit is address space. The theoretical limit is 4GB on a 32-bit machine - however, the actual limit will be smaller since some areas will be reserved for other purposes. If the LFS interface is used the file size on 32-bit systems is not limited to 2GB … the full 64-bit [8 EiB] are available. …

    Creating a File Mapping Using Large Pages

    … you must specify the FILE_MAP_LARGE_PAGES flag with the MapViewOfFile function to map large pages. …