I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed.
Parquet seems to support file-wide metadata, but I cannot find how the write it via pyarrow. The closest thing I could find is how to write row-group metadata, but this seems like an overkill, since my metadata is the same for all row groups in the file.
Is there any way to write file-wide Parquet metadata with pyarrow?
This example shows how to create a Parquet file with file metadata and column metadata with PyArrow.
Suppose you have the following CSV data:
movie,release_year
three idiots,2009
her,2013
Read the CSV into a PyArrow table and define a custom schema with column / file metadata:
import pyarrow.csv as pv
import pyarrow.parquet as pq
import pyarrow as pa
table = pv.read_csv('movies.csv')
my_schema = pa.schema([
pa.field("movie", "string", False, metadata={"spanish": "pelicula"}),
pa.field("release_year", "int64", True, metadata={"portuguese": "ano"})],
metadata={"great_music": "reggaeton"})
Create a new table with my_schema
and write it out as a Parquet file:
t2 = table.cast(my_schema)
pq.write_table(t2, 'movies.parquet')
Read the Parquet file and fetch the file metadata:
s = pq.read_table('movies.parquet').schema
s.metadata # => {b'great_music': b'reggaeton'}
s.metadata[b'great_music'] # => b'reggaeton'
Fetch the metadata associated with the release_year
column:
parquet_file.schema.field('release_year').metadata[b'portuguese'] # => b'ano'