pythonparquetpyarrow

How to write Parquet metadata with pyarrow?


I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed.

Parquet seems to support file-wide metadata, but I cannot find how the write it via pyarrow. The closest thing I could find is how to write row-group metadata, but this seems like an overkill, since my metadata is the same for all row groups in the file.

Is there any way to write file-wide Parquet metadata with pyarrow?


Solution

  • This example shows how to create a Parquet file with file metadata and column metadata with PyArrow.

    Suppose you have the following CSV data:

    movie,release_year
    three idiots,2009
    her,2013
    

    Read the CSV into a PyArrow table and define a custom schema with column / file metadata:

    import pyarrow.csv as pv
    import pyarrow.parquet as pq
    import pyarrow as pa
    
    table = pv.read_csv('movies.csv')
    
    my_schema = pa.schema([
        pa.field("movie", "string", False, metadata={"spanish": "pelicula"}),
        pa.field("release_year", "int64", True, metadata={"portuguese": "ano"})],
        metadata={"great_music": "reggaeton"})
    

    Create a new table with my_schema and write it out as a Parquet file:

    t2 = table.cast(my_schema)
    
    pq.write_table(t2, 'movies.parquet')
    

    Read the Parquet file and fetch the file metadata:

    s = pq.read_table('movies.parquet').schema
    
    s.metadata # => {b'great_music': b'reggaeton'}
    s.metadata[b'great_music'] # => b'reggaeton'
    

    Fetch the metadata associated with the release_year column:

    parquet_file.schema.field('release_year').metadata[b'portuguese'] # => b'ano'