pythonout-of-memory

Importing large IDL files into Python with SciPy


I currently use scipy.io.readsav() to import IDL .sav files to Python, which is working well, eg:

data = scipy.io.readsav('data.sav', python_dict=True, verbose=True)

However, if the .sav file is large (say > 1 GB), I get a MemoryError when trying to import into Python.

Usually, iterating through the data would of course solve this (if it were a .txt or .csv file) rather than loading it in all in at once, but I don't see how I can do this when using .sav files, considering the only method I know of to import it is using readsav.

Any ideas how I can avoid this memory error?


Solution

  • You expressed interest in iterating over a .sav file. One (not too onerous) way to do this would be to write a lightweight wrapper class or function to use instead of SciPy's readsav(), using the slightly lower-level functions in the scipy.io.idl module, such as _read_record().

    Using that function, one could do something like the following:

    from scipy.io import idl
    
    def sav_iterator(file_path):
        with open(file_path, "rb") as fp: # open file for reading in binary mode
            signature = fp.read(2) # should be b'SR'
            recfmt = fp.read(2) # should be b'\x00\x04' for uncompressed
            while True:
                record_dict = idl._read_record(fp) # parses dict and advances file pointer accordingly
                yield record_dict
                if record_dict["rectype"] == "END_MARKER":
                    break # stop iteration
    
    for record in sav_iterator("my_data.sav"):
        do_something_with(record) # placeholder
    

    With this method, only one record's worth of data ever needs to be held in memory at a time.