[SOLVED] Loading pandas DataFrame from parquet - lists are deserialized as numpy's ndarrays

Loading pandas DataFrame from parquet - lists are deserialized as numpy's ndarrays

import pandas as pd
df = pd.DataFrame({
    "col1" : ["a", "b", "c"],
    "col2" : [[1,2,3], [4,5,6,7], [8,9,10,11,12]]
})
df.to_parquet("./df_as_pq.parquet")
df = pd.read_parquet("./df_as_pq.parquet")
[type(val) for val in df["col2"].tolist()]

Output:

[<class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>]

Is there any way in which I can read the parquet file and get the list values as pythonic lists (just like at creation)? Preferably using pandas but willing to try alternatives.

The problem I'm facing is that I have no prior knowledge which columns hold lists, so I check types similarly to what I do in the code. Assuming I'm not interested in adding numpy currently as a dependency, is there any way to check if a variable is array-like without explicitly importing and specifying np.ndarray?

Solution

You can't change this behavior in the API, either when loading the parquet file into an arrow table or converting the arrow table to pandas.

But you can write your own function that would look at the schema of the arrow table and convert every list field to a python list

import pyarrow as pa
import pyarrow.parquet as pq

def load_as_list(file):
    table = pq.read_table(file)
    df = table.to_pandas()
    for field in table.schema:
        if pa.types.is_list(field.type):
            df[field.name] = df[field.name].apply(list)
    return df


load_as_list("./df_as_pq.parquet")