Situation: I have a CVD (ClamAV Virus Database) file loaded into RAM using mmap. The format of every line in the CVD file is same as the one of CSV files (':' delimited). Below is a snippet of the code:
def mapping():
with open("main.cvd", 'rt') as f:
global mapper
mapper = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
csv.register_dialect('delimit', delimiter=':', quoting=csv.QUOTE_NONE)
def compare(hashed):
for row in csv.reader(mapper, dialect='delimit'):
if row[1] == hashed:
print('Found!')
Problem: When run, it returns the error _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
Question: How do I read CSV files as text that have been loaded to memory?
Additional information 1: I have tried using StringIO
, it throws the error TypeError: initial_value must be str or None, not mmap.mmap
Additional information 2: I need the file to be in the RAM for faster access to the file and I cannot sacrifice time reading it line by line using functions such as readline()
The csvfile
argument to the csv.reader
constructor "can be any object which supports the iterator protocol and returns a string each time its next()
method is called".
This means the "object" can be a generator function or a generator expression. In the code below I've implement a generator function called mmap_file_reader()
which will convert the bytes in the memory map into character strings and yield
each line of output it detects.
I made the mmap.mmap
constructor call conditional so it would work on Windows, too. This shouldn't be necessary if you used the access=
keyword instead of prot=
keyword—but I couldn't test that and so did it as shown.
import csv
import mmap
import sys
def mapping():
with open("main.cvd", 'rt') as f:
global mapper
if sys.platform.startswith('win32'):
mmf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) # windows
else:
mmf = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) # unix
mapper = mmap_file_reader(mmf)
csv.register_dialect('delimit', delimiter=':', quoting=csv.QUOTE_NONE)
def mmap_file_reader(mmf):
'''Yield successive lines of the given memory-mapped file as strings.
Generator function which reads and converts the bytes of the given mmapped file
to strings and yields them one line at a time.
'''
while True:
line = mmf.readline()
if not line: # EOF?
return
yield str(line, encoding='utf-8') # convert bytes of lineread into a string
def compare(hashed):
for row in csv.reader(mapper, dialect='delimit'):
if row[1] == hashed:
print('Found!')