matlabarchivetardisk-access

Matlab: direct/efficient untar to memory to avoid slow disk interactions


Given a .tar archive, Matlab allows one to extract the contained files to disk via UNTAR command. One can then manipulate the extracted files in the ordinary way.

Issue: When several files are stored in a tarball, they are stored contiguously on disk and, in principle, they can be accessed serially. When such files are extracted, this contiguity doesn't hold any more and the file access can become random, hence slow & inefficient.

This is especially critical when the considered files are many (thousands) and small.

My question: is there any way to access to the archived files avoiding the preliminary extraction (in a sort of HDF5 fashion)?

In other words, would it be possible to cache the .tar so to access the contained files from the memory rather than from the disk?


(In general, direct .tar manipulation is possible, e.g. is C# tar-cs, in python).


Solution

  • After some time I finally worked out a solution which gave me unbelievable speedups (like 10x or so).

    In a word: ramdisk (tested on Linux (Ubuntu & CentOs)).


    Recap:

    Since the problem has some generality, let me state it again in a more complete fashion.

    Say that I have many small files stored on disk (txt,pict, order of millions) which I want to manipulate (e.g. via matlab).

    Working on such files (i.e. loading them/transmitting them on network) when they are stored on disk is tremendously slow since the disk access is mostly random.

    Hence, tarballing the files in archives (e.g. of fixed size) looked to me like a good way to keep the disk access sequential.

    Problem:

    If case the manipulation of the .tar requires a preliminary extraction to disk (as it happens with matlab's UNTAR), the speed up given by sequential disk access is mostly loss.

    Workaround:

    The tarball (provided it is reasonably small) can be extracted to memory and then processed from there. In matlab, as I stated in the question, .tar manipulation in memory is not possible, though.

    What can be done (equivalently) is untarring to ramdisk.

    In linux, e.g. Ubuntu, a default ramdisk drive is mounted in /run/shm (tempfs). Files can be untarred via matlab there, having then extremely fast access.

    In other words, a possible workcycle is:

    1. untar to /run/shm/mytemp
    2. manipulate in memory
    3. possibly tar again the output to disk

    This allowed me to change the scale-time of my processing from 8hrs to 40min and full CPUs load.