pythonpandashdf5parquethdf

Converting HDF5 to Parquet without loading into memory


I have a large dataset (~600 GB) stored as HDF5 format. As this is too large to fit in memory, I would like to convert this to Parquet format and use pySpark to perform some basic data preprocessing (normalization, finding correlation matrices, etc). However, I am unsure how to convert the entire dataset to Parquet without loading it into memory.

I looked at this gist: https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f#file-hdf_to_parquet-py, but it appears that the entire dataset is being read into memory.

One thing I thought of was reading the HDF5 file in chunks and saving that incrementally into a Parquet file:

test_store = pd.HDFStore('/path/to/myHDFfile.h5')
nrows = test_store.get_storer('df').nrows
chunksize = N
for i in range(nrows//chunksize + 1):
    # convert_to_Parquet() ...

But I can't find any documentation that would allow me to incrementally build up a Parquet file. Any links to further reading?


Solution

  • You can use pyarrow for this!

    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq
    
    
    def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):
    
        stream = pd.read_hdf(h5_file, chunksize=chunksize)
    
        for i, chunk in enumerate(stream):
            print("Chunk {}".format(i))
    
            if i == 0:
                # Infer schema and open parquet file on first chunk
                parquet_schema = pa.Table.from_pandas(df=chunk).schema
                parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
    
            table = pa.Table.from_pandas(chunk, schema=parquet_schema)
            parquet_writer.write_table(table)
    
        parquet_writer.close()