I have data stored in either a collection of files or in a single compound file. The compound file is formed by concatenating all the separate files, and then preceding everything with a header that gives the offsets and sizes of the constituent parts. I'd like to have a file-like object that presents a view of the compound file, where the view represents just one of the member files. (That way, I can have functions for reading the data that accept either a real file object or a "view" object, and they needn't worry about how any particular dataset is stored.) What library will do this for me?
The mmap
class looked promising since it's constructed from a file, a length, and an offset, which is exactly what I have, but the offset needs to be aligned with the underlying file system's allocation granularity, and the files I'm reading don't meet that requirement. The name of the MultiFile
class fits the bill, but it's tailored for attachments in e-mail messages, and my files don't have that structure.
The file operations I'm most interested in are read
, seek
, and tell
. The files I'm reading are binary, so the text-oriented functions like readline
and next
aren't so crucial. I might eventually also need write
, but I'm willing to forego that feature for now since I'm not sure how appending should behave.
I know you were searching for a library, but as soon as I read this question I thought I'd write my own. So here it is:
import os
class View:
def __init__(self, f, offset, length):
self.f = f
self.f_offset = offset
self.offset = 0
self.length = length
def seek(self, offset, whence=0):
if whence == os.SEEK_SET:
self.offset = offset
elif whence == os.SEEK_CUR:
self.offset += offset
elif whence == os.SEEK_END:
self.offset = self.length+offset
else:
# Other values of whence should raise an IOError
return self.f.seek(offset, whence)
return self.f.seek(self.offset+self.f_offset, os.SEEK_SET)
def tell(self):
return self.offset
def read(self, size=-1):
self.seek(self.offset)
if size<0:
size = self.length-self.offset
size = max(0, min(size, self.length-self.offset))
self.offset += size
return self.f.read(size)
if __name__ == "__main__":
f = open('test.txt', 'r')
views = []
offsets = [i*11 for i in range(10)]
for o in offsets:
f.seek(o+1)
length = int(f.read(1))
views.append(View(f, o+2, length))
f.seek(0)
completes = {}
for v in views:
completes[v.f_offset] = v.read()
v.seek(0)
import collections
strs = collections.defaultdict(str)
for i in range(3):
for v in views:
strs[v.f_offset] += v.read(3)
strs = dict(strs) # We want it to raise KeyErrors after that.
for offset, s in completes.iteritems():
print offset, strs[offset], completes[offset]
assert strs[offset] == completes[offset], "Something went wrong!"
And I wrote another script to generate the "test.txt" file:
import string, random
f = open('test.txt', 'w')
for i in range(10):
rand_list = list(string.ascii_letters)
random.shuffle(rand_list)
rand_str = "".join(rand_list[:9])
f.write(".%d%s" % (len(rand_str), rand_str))
It worked for me. The files I tested on are not binary files like yours, and they're not as big as yours, but this might be useful, I hope. If not, then thank you, that was a good challenge :D
Also, I was wondering, if these are actually multiple files, why not use some kind of an archive file format, and use their libraries to read them?
Hope it helps.