arrayspython-3.xdatasetparquetawkward-array

AwkwardArray: Possible to append an array to an exisitng Parquet file?


Is it possible with AwkwardArray (awkward0) to append to an existing parquet file (written by AwkwardArray)?

Normal Awkward Parquet storing

The following code creates a Parquet file with inside a few Awkward arrays (e.g. audio data):

import numpy as np
import awkward as awk
import pyarrow.parquet as pq

# create Awkward Table from dict with numpy arrays
awk_array = awk.fromiter([{"ch0": np.array([0, 1, 2]), "ch1": np.array([3, 4, 5])},
                        {"ch0": np.array([6, 7]), "ch1": np.array([8, 9])}])
awk_array.tolist()
# [{'ch0': [0, 1, 2], 'ch1': [3, 4, 5]}, {'ch0': [6, 7], 'ch1': [8, 9]}]

# save in Parquet format
awk.toparquet("audio.parquet", awk_array)

# check if we can successfully load again; success
awk.fromparquet("audio.parquet")["ch0"].tolist()
# [[0, 1, 2], [6, 7]]

Appending Parquet (no Awkward)

In the pyarrow documentation about Parquet files, you can extend a Parquet file with:

with pq.ParquetWriter('example3.parquet', table.schema) as writer:
    for i in range(3):
        writer.write_table(table)

Question

Is something like this possible with Awkward arrays?:

akw_arrays = []
akw_arrays.append(awk.fromiter([{"ch0": np.array([0, 1, 2]), "ch1": np.array([3, 4, 5])}]))
akw_arrays.append(awk.fromiter([{"ch0": np.array([6, 7]), "ch1": np.array([8, 9])}]))

# Awkward table schema
with pq.ParquetWriter("audio_append.parquet", awk.table.schema) as writer:
    for i in range(len(akw_arrays)):
        writer.write_table(akw_arrays[i])

Something like with a awkward.table.schema or an awkward.ParquetWriter()?

In reality, I don't have both arrays at the same time. Therefore, concatenating before writing is not possible.

Or is the only possibility to make use of something like Apache Arrow, and write everything at once to disk at the end?


Solution

  • The answer to this is no, but there isn't a good reason why not. As you have shown, you can append to Parquet, and in fact Awkward uses this when writing ChunkedArrays (arrow.py#L418-L440). A different interface, reusing most of the code you see there, could leave the Parquet file open for appending. That would be very useful for large datasets.

    Since Parquet files are navigated from a footer (by definition at the end of a file), I don't think Parquet files are appendable after they have been closed. (Something would have to invalidate or overwrite the original footer.) So this Awkward interface to iteratively writing Parquet files would have to open the Parquet file in a with block, to ensure that this footer is written exactly once.