I have a list of indices and for each index I have two 2D numpy arrays and a few scalar values.
One record might look like: index (long)
, data (np.ndarray)
, mask (np.ndarray)
, name (str)
.
I would like to store this information in a pyarrow.Table
and then append to delta using deltalake.write_deltalake
(delta table is required).
I don't want to use tolist()
method on the numpy arrays, because it is not zero-copy and requires reallocation which is not ideal when I have ~10k arrays with shape around (150,150)
. At the same time, I'd prefer to have 2D arrays as entries compared to flattened arrays so that I don't need to worry about storing number of rows and columns for each array (I have a solution for the 1D case).
I tried to utilize the function below to get pa.ListArray
or pa.FixedSizeListArray
, but I cannot figure out how to pass a list of pyarrow
arrays to the constructor of pa.Table
.
# Minimal (not) working example
import numpy as np
import pyarrow as pa
def numpy_2d_array_to_arrow(arr: np.ndarray) -> pa.ListArray:
"""
Zero copy conversion from 2D numpy.ndarray to Apache Arrow List Array
https://issues.apache.org/jira/browse/ARROW-5645
"""
assert (
len(arr.shape) == 2
), 'numpy.ndarray to Arrow conversion error: Expected 2D array.'
# This also works...
# return pa.FixedSizeListArray.from_arrays(arr.ravel(order="C"), arr.shape[1])
offsets = np.arange(0, np.prod(arr.shape) + arr.shape[1], arr.shape[1])
return pa.ListArray.from_arrays(offsets, arr.ravel(order='C'))
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
pa_arr = numpy_2d_array_to_arrow(arr)
arrays = [pa_arr, pa_arr] # example (one column, two entries)
indices = [0, 1]
# This fails
table = pa.Table.from_arrays([indices, arrays], ['index', 'data'])
I have tried to call pa.table(...)
directly, but multi-dimensional numpy arrays are not currently supported out of the box (ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column data with type object')
).
On the other hand, passing a list of list (list[list[value]]
) works, but I would have to reallocate the array for that.
You can convert the numpy 2d array to a FixedSizeListArray before adding it to the arrow table.
import numpy as np
import pyarrow as pa
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
fs_arr = pa.FixedSizeListArray.from_arrays(arr.reshape(-1), list_size=arr.shape[1])
table = pa.table({"col1": [1, 2, 3, 4], "col2": fs_arr})
col1 | col2 |
---|---|
1 | [1, 2, 3] |
2 | [4, 5, 6] |
3 | [7, 8, 9] |
4 | [10, 11, 12] |
For this you need to concatenate all the underlying values of the 2d matrices.
Then you need to build the "offsets" of the rows and create a pa.ListArray
from the offsets and the concatenated values.
Then you build the offsets of each matrix (within the list of rows) and create a pa.ListArray
from these offsets and the row pa.ListArray
.
import numpy as np
import pyarrow as pa
np_arrays = [
np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]),
np.array([[1, 2], [3, 4]]),
]
values = pa.concat_arrays(pa.array(np_array.reshape(-1)) for np_array in np_arrays)
row_sizes = [np.repeat(np_array.shape[1], np_array.shape[0]) for np_array in np_arrays]
row_offsets = np.insert(np.cumsum(np.concatenate(row_sizes)), 0, 0)
row_list_array = pa.ListArray.from_arrays(offsets=row_offsets, values=values)
matrix_sizes = [np_array.shape[0] for np_array in np_arrays]
matrix_offsets = np.insert(np.cumsum(matrix_sizes), 0, 0)
matrix_list_array = pa.ListArray.from_arrays(
offsets=matrix_offsets, values=row_list_array
)
table = pa.table({"col1": [1, 2], "col2": matrix_list_array})