pythondelta-lakepyarrowapache-arrow

How to store Numpy arrays or lists of lists as entries in a single column of a PyArrow Table?


I have a list of indices and for each index I have two 2D numpy arrays and a few scalar values.

One record might look like: index (long), data (np.ndarray), mask (np.ndarray), name (str).

I would like to store this information in a pyarrow.Table and then append to delta using deltalake.write_deltalake (delta table is required).

I don't want to use tolist() method on the numpy arrays, because it is not zero-copy and requires reallocation which is not ideal when I have ~10k arrays with shape around (150,150). At the same time, I'd prefer to have 2D arrays as entries compared to flattened arrays so that I don't need to worry about storing number of rows and columns for each array (I have a solution for the 1D case).

I tried to utilize the function below to get pa.ListArray or pa.FixedSizeListArray, but I cannot figure out how to pass a list of pyarrow arrays to the constructor of pa.Table.

# Minimal (not) working example

import numpy as np
import pyarrow as pa


def numpy_2d_array_to_arrow(arr: np.ndarray) -> pa.ListArray:
    """
    Zero copy conversion from 2D numpy.ndarray to Apache Arrow List Array
    https://issues.apache.org/jira/browse/ARROW-5645
    """
    assert (
        len(arr.shape) == 2
    ), 'numpy.ndarray to Arrow conversion error: Expected 2D array.'
    # This also works...
    # return pa.FixedSizeListArray.from_arrays(arr.ravel(order="C"), arr.shape[1])
    offsets = np.arange(0, np.prod(arr.shape) + arr.shape[1], arr.shape[1])
    return pa.ListArray.from_arrays(offsets, arr.ravel(order='C'))

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])

pa_arr = numpy_2d_array_to_arrow(arr)
arrays = [pa_arr, pa_arr]  # example (one column, two entries)
indices = [0, 1]

# This fails
table = pa.Table.from_arrays([indices, arrays], ['index', 'data'])

I have tried to call pa.table(...) directly, but multi-dimensional numpy arrays are not currently supported out of the box (ArrowInvalid: ('Can only convert 1-dimensional array values', 'Conversion failed for column data with type object')).

On the other hand, passing a list of list (list[list[value]]) works, but I would have to reallocate the array for that.


Solution

  • 1d numpy array

    You can convert the numpy 2d array to a FixedSizeListArray before adding it to the arrow table.

    import numpy as np
    import pyarrow as pa
    
    arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
    
    fs_arr = pa.FixedSizeListArray.from_arrays(arr.reshape(-1), list_size=arr.shape[1])
    
    table = pa.table({"col1": [1, 2, 3, 4], "col2": fs_arr})
    
    col1 col2
    1 [1, 2, 3]
    2 [4, 5, 6]
    3 [7, 8, 9]
    4 [10, 11, 12]

    2d numpy array

    For this you need to concatenate all the underlying values of the 2d matrices.

    Then you need to build the "offsets" of the rows and create a pa.ListArray from the offsets and the concatenated values.

    Then you build the offsets of each matrix (within the list of rows) and create a pa.ListArray from these offsets and the row pa.ListArray.

    import numpy as np
    import pyarrow as pa
    
    np_arrays = [
        np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]),
        np.array([[1, 2], [3, 4]]),
    ]
    
    values = pa.concat_arrays(pa.array(np_array.reshape(-1)) for np_array in np_arrays)
    
    row_sizes = [np.repeat(np_array.shape[1], np_array.shape[0]) for np_array in np_arrays]
    row_offsets = np.insert(np.cumsum(np.concatenate(row_sizes)), 0, 0)
    row_list_array = pa.ListArray.from_arrays(offsets=row_offsets, values=values)
    
    matrix_sizes = [np_array.shape[0] for np_array in np_arrays]
    matrix_offsets = np.insert(np.cumsum(matrix_sizes), 0, 0)
    matrix_list_array = pa.ListArray.from_arrays(
        offsets=matrix_offsets, values=row_list_array
    )
    
    
    table = pa.table({"col1": [1, 2], "col2": matrix_list_array})