pythonpandasparquet

Loading pandas DataFrame from parquet - lists are deserialized as numpy's ndarrays


import pandas as pd
df = pd.DataFrame({
    "col1" : ["a", "b", "c"],
    "col2" : [[1,2,3], [4,5,6,7], [8,9,10,11,12]]
})
df.to_parquet("./df_as_pq.parquet")
df = pd.read_parquet("./df_as_pq.parquet")
[type(val) for val in df["col2"].tolist()]

Output:

[<class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>]

Is there any way in which I can read the parquet file and get the list values as pythonic lists (just like at creation)? Preferably using pandas but willing to try alternatives.

The problem I'm facing is that I have no prior knowledge which columns hold lists, so I check types similarly to what I do in the code. Assuming I'm not interested in adding numpy currently as a dependency, is there any way to check if a variable is array-like without explicitly importing and specifying np.ndarray?


Solution

  • You can't change this behavior in the API, either when loading the parquet file into an arrow table or converting the arrow table to pandas.

    But you can write your own function that would look at the schema of the arrow table and convert every list field to a python list

    import pyarrow as pa
    import pyarrow.parquet as pq
    
    def load_as_list(file):
        table = pq.read_table(file)
        df = table.to_pandas()
        for field in table.schema:
            if pa.types.is_list(field.type):
                df[field.name] = df[field.name].apply(list)
        return df
    
    
    load_as_list("./df_as_pq.parquet")