I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena.
After finding two Python libraries (Arrow and fastparquet) supporting writing to Parquet files I have struggled for a while trying to achieve arrays of structs.
The top answer to a question of writing Parquet files list these two libraries (and do mention lacking support for nested data).
So is there a way to write nested data to Parquet files from Python?
I tried the following with Arrow in order to store key/values.
import pyarrow as pa
import pyarrow.parquet as pq
countries = []
populations = []
countries.append('Sweden')
populations.append([{'city': 'Stockholm', 'population': 1515017}, {'city': 'Gothenburg', 'population': 590580}])
countries.append('Norway')
populations.append([{'city': 'Oslo', 'population': 958378}, {'city': 'Bergen', 'population': 254235}])
ty = pa.struct([pa.field('city', pa.string()),
pa.field('population', pa.int32())
])
fields = [
pa.field('country', pa.string()),
pa.field('populations', pa.list_(ty)),
]
sch1 = pa.schema(fields)
data = [
pa.array(countries),
pa.array(populations, type=pa.list_(ty))
]
batch = pa.RecordBatch.from_arrays(data, ['country', 'populations'])
table = pa.Table.from_batches([batch], sch1)
writer = pq.ParquetWriter('cities.parquet', sch1)
writer.write_table(table)
writer.close()
When I ran the code I got the following message:
Traceback (most recent call last):
File "stackoverflow.py", line 30, in <module>
writer.write_table(table)
File "/Users/moonhouse/anaconda2/envs/parquet/lib/python3.6/site-packages/pyarrow/parquet.py", line 327, in write_table
self.writer.write_table(table, row_group_size=row_group_size)
File "_parquet.pyx", line 955, in pyarrow._parquet.ParquetWriter.write_table
File "error.pxi", line 77, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children
An answer in a recent Arrow JIRA ticket featuring the same error message suggests there is ongoing work of supporting structs although it is unclear to me if it covers writing or just reading of those.
When I tried to store data using fastparquet (like here when I had a list of strings):
import pandas as pd
from fastparquet import write
data = [{ 'cities': ['Stockholm', 'Copenhagen', 'Oslo', 'Helsinki']}]
df = pd.DataFrame(data)
write('test.parq', df, compression='SNAPPY')
no error message was given but when viewed in parquet-tools I noticed that the data is Base64-encoded JSON.
cities = WyJTdG9ja2hvbG0iLCAiQ29wZW5oYWdlbiIsICJPc2xvIiwgIkhlbHNpbmtpIl0=
This is expected, I guess, given that fastparquet doesn't support nested object arrays.
Pulling arrow >= 0.17.0 should fix your error.