pythonpandasnumpyraggedragged-tensors

Pandas rows containing numpy ndarrays various shapes


I'd creating a Pandas DataFrame in which each particular (index, column) location can be a numpy ndarray of arbitrary shape, or even a simple number.

This works:

import numpy as np, pandas as pd
x = pd.DataFrame([[np.random.rand(100, 100, 20, 2), 3], [2, 2], [3, 3], [4, 4]],
                              index=['A1', 'B2', 'C3', 'D4'], columns=['data', 'data2'])
print(x)

but takes 50 seconds to create on my computer! Why?

np.random.rand(100, 100, 20, 2) alone is super fast (< 1 second to create)

How to speed up the creation of Pandas datasets containing ndarrays of various shapes?


Solution

  • It's not actually the creation that is the issue, it's the print statement. 1000 loops of the creation take 2.8 seconds on my computer. But one iteration of the print takes about 26 seconds.

    Interestingly, print(x['data2']), print(x['data']['A1']) and print(x['data']['B2']) are all basically instantaneous. So it seems print is having an issue figuring out how to display items of vastly different size. Perhaps a bug?