[SOLVED] Efficient way to build a data set from fits image

Efficient way to build a data set from fits image

I have a set of fits images: about 32000 images with resolution (256,256). The dataset that i've to build is matrix like, so the output shape is (32000, 256*256).

The simple solution is a for loop, samething like:

#file_names is a list of paths
samples=[]
for file_name in file_names:
    hdu=pyfits.open(file_name)
    samples.append(hdu[0].data.flatten())
    hdu.close()
#then i can use numpy.concatenate to have a numpy ndarray

This solution is very, very slow. So what is the best solution to build a so big data set?

Solution

This isn't really intended to be the main answer, but I felt it was too long for a comment and is relevant.

I believe there are a few things you can do without adjusting your code.

Python is a syntactical language and is implemented in different ways. The traditional implementation is CPython, which is what you download from the website. However, there are other implementations (see here).

Long story short, try PyPy as it often runs significantly faster with "memory-hungry python" such as yours. Here is a very nice reddit post about the advantages of each, but basically use PyPy, and optimize your code. Additionally, I have never used Numpy but this post suggests you might be able to keep Numpy and still use PyPy.

(Normally, I would also suggest you use Cython, but it does not appear to work nicely with Numpy at all. I don't know if Cython has any support for Numpy, but you can google that yourself.) Good luck!