arrayspandasnumpypython-3.10vaex

Efficiently convert numpy matrix to Vaex DataFrame


I'm trying to turn my wide (100K+ columns) 2D numpy data into a Vaex Dataframe. I'm reading through the documentation, and I see two relevant functions:

from_items

from_arrays

but both give me an entire column x, where each row is a numpy array. What I expected was for Vaex to intelligently recognize that I want each column of data from the numpy array to be its own separate column in the Vaex DataFrame.

vaex.from_arrays(x=2d_numpy_matrix) gives me:

x
---
0 np.array(1,2,3)
1 np.array(4,5,6)

when I wanted:

0 | 1 | 2 (Column header)
---
1 | 2 | 3
4 | 5 | 6

My workaround is vaex.from_pandas(pd.DataFrame(2d_numpy_matrix)) but this is embarrassingly slow. Is there a more CPU-time efficient way to do this?


Solution

  • You can unpack a dictionary comprehension like this:

    import numpy as np
    import vaex
    
    headers = np.array(['1','2','3'])
    data = np.array([[1,4],[2,5],[3,6]])
    
    df = vaex.from_arrays(**{header: column for header, column in zip(headers, data)})
    

    This yields:

    >>> df
    #    0    1    2
    0    1    2    3
    1    4    5    6