[SOLVED] Importing large IDL files into Python with SciPy

Importing large IDL files into Python with SciPy

I currently use scipy.io.readsav() to import IDL .sav files to Python, which is working well, eg:

data = scipy.io.readsav('data.sav', python_dict=True, verbose=True)

However, if the .sav file is large (say > 1 GB), I get a MemoryError when trying to import into Python.

Usually, iterating through the data would of course solve this (if it were a .txt or .csv file) rather than loading it in all in at once, but I don't see how I can do this when using .sav files, considering the only method I know of to import it is using readsav.

Any ideas how I can avoid this memory error?

Solution

You expressed interest in iterating over a .sav file. One (not too onerous) way to do this would be to write a lightweight wrapper class or function to use instead of SciPy's readsav(), using the slightly lower-level functions in the scipy.io.idl module, such as _read_record().

Using that function, one could do something like the following:

from scipy.io import idl

def sav_iterator(file_path):
    with open(file_path, "rb") as fp: # open file for reading in binary mode
        signature = fp.read(2) # should be b'SR'
        recfmt = fp.read(2) # should be b'\x00\x04' for uncompressed
        while True:
            record_dict = idl._read_record(fp) # parses dict and advances file pointer accordingly
            yield record_dict
            if record_dict["rectype"] == "END_MARKER":
                break # stop iteration

for record in sav_iterator("my_data.sav"):
    do_something_with(record) # placeholder

With this method, only one record's worth of data ever needs to be held in memory at a time.