[SOLVED] Fastest way to convert results from tuple of tuples to 2D numpy.array

Fastest way to convert results from tuple of tuples to 2D numpy.array

I'm training my AI model with a huge data set, so that it's impossible to preload all the data into memory at the beginning. I'm currently using psycopg2 to load data from a Postgresql DB during training.

I need to convert the sample data into numpy.ndarry, while psycopg2 returns data in a tuple of tuples, where every inner tuples carry a row of data, and the outter tuple carries all.

This piece of codes(loading data) is the hotest spot of entire training processing, so I want to make it as fast as possible.

My current codes like this:

rows = cursor.fetchall()                  # rows is a tuple of tuples
features = [np.array( r, dtype=np.float32 ) for r in rows]
features = np.stack( features, axis=0 )   # final shape of output is (128row, 78column)

I'm wondering wether or not there is a faster way to convert a tuple of tuples into a 2D numpy array?

Thanks!

Solution

Postgres is performant, and generally very nice. But we’re not using a binary NDArray format during the data transfer, so as you observe, there’s room to improve.

You didn’t indicate which line took the bulk of the time. Finessing the SELECT query might let us avoid that final reshape.

I assume you load and retrain multiple times during the development cycle, so making repeated reloads of the same old data go faster would be attractive. Consider using binary Parquet or .h5 temp files. (Big ones, holding a great many 128 x 78 samples.) Then the format on disk matches the format in memory, and we can mmap() the data to access it, without translation or other munging.

If you keep the current code, at least push the “looping” down into the RDBMS, by issuing a single giant SELECT. This is a generic rule that I have seen be a win many times. Delete the top level “for” loop which issues many small queries. Instead, reveal the whole game plan to the DB backend, and let it stream data continuously. It lets us move from worrying about latency to worrying about throughput.