pythonpython-3.xcsvmmapclam

Reading CSV files from RAM


Situation: I have a CVD (ClamAV Virus Database) file loaded into RAM using mmap. The format of every line in the CVD file is same as the one of CSV files (':' delimited). Below is a snippet of the code:

def mapping():
    with open("main.cvd", 'rt') as f:
        global mapper
        mapper = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
        csv.register_dialect('delimit', delimiter=':', quoting=csv.QUOTE_NONE)

def compare(hashed):
    for row in csv.reader(mapper, dialect='delimit'):
        if row[1] == hashed:
            print('Found!')

Problem: When run, it returns the error _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

Question: How do I read CSV files as text that have been loaded to memory?

Additional information 1: I have tried using StringIO, it throws the error TypeError: initial_value must be str or None, not mmap.mmap

Additional information 2: I need the file to be in the RAM for faster access to the file and I cannot sacrifice time reading it line by line using functions such as readline()


Solution

  • The csvfile argument to the csv.reader constructor "can be any object which supports the iterator protocol and returns a string each time its next() method is called".

    This means the "object" can be a generator function or a generator expression. In the code below I've implement a generator function called mmap_file_reader() which will convert the bytes in the memory map into character strings and yield each line of output it detects.

    I made the mmap.mmap constructor call conditional so it would work on Windows, too. This shouldn't be necessary if you used the access= keyword instead of prot= keyword—but I couldn't test that and so did it as shown.

    import csv
    import mmap
    import sys
    
    def mapping():
        with open("main.cvd", 'rt') as f:
            global mapper
            if sys.platform.startswith('win32'):
                mmf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)  # windows
            else:
                mmf = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)  # unix
            mapper = mmap_file_reader(mmf)
            csv.register_dialect('delimit', delimiter=':', quoting=csv.QUOTE_NONE)
    
    def mmap_file_reader(mmf):
        '''Yield successive lines of the given memory-mapped file as strings.
    
        Generator function which reads and converts the bytes of the given mmapped file
        to strings and yields them one line at a time.
        '''
        while True:
            line = mmf.readline()
            if not line:  # EOF?
                return
            yield str(line, encoding='utf-8')  # convert bytes of lineread into a string
    
    def compare(hashed):
        for row in csv.reader(mapper, dialect='delimit'):
            if row[1] == hashed:
                print('Found!')