import pandas as pd
df = pd.DataFrame({
"col1" : ["a", "b", "c"],
"col2" : [[1,2,3], [4,5,6,7], [8,9,10,11,12]]
})
df.to_parquet("./df_as_pq.parquet")
df = pd.read_parquet("./df_as_pq.parquet")
[type(val) for val in df["col2"].tolist()]
Output:
[<class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>]
Is there any way in which I can read the parquet file and get the list values as pythonic lists (just like at creation)?
Preferably using pandas
but willing to try alternatives.
The problem I'm facing is that I have no prior knowledge which columns hold lists, so I check types similarly to what I do in the code. Assuming I'm not interested in adding numpy currently as a dependency, is there any way to check if a variable is array-like without explicitly importing and specifying np.ndarray
?
You can't change this behavior in the API, either when loading the parquet file into an arrow table or converting the arrow table to pandas.
But you can write your own function that would look at the schema of the arrow table and convert every list
field to a python list
import pyarrow as pa
import pyarrow.parquet as pq
def load_as_list(file):
table = pq.read_table(file)
df = table.to_pandas()
for field in table.schema:
if pa.types.is_list(field.type):
df[field.name] = df[field.name].apply(list)
return df
load_as_list("./df_as_pq.parquet")