I have a series of hdf5 files holding large panda data frames. a typical file is around 1000,000 rows. I use complib='blosc',complevel=9 for compression. original hdf5 files are saved as 1 flat file.
I then tried segmenting the dataframes logically to 30 smaller dataframes and saved them in the same hdf5 file with 30 different keys and the same compression.
The shocking issue is that the files with 30 smaller dataframes are 40X larger than the flat file.
flat hdf5 file is saved as follows:
dfx.to_hdf(file_name, key='opp',mode='a',complib='blosc',complevel=9, append=True)
segemented hdf5 file is saved as follows:
for i in range(30): dfx_small[i].to_hdf(file_name,key='d'+str(i), mode='a',complib='blosc',complevel=9
Am I doing something wrong or is this size increase expected?
I compared all the hdf5 files generated as 1)flat dataframe vs 2)30 chunk dataframe - It seems largest files gain 8x to 10x in size when dataframe is saved as 30 smaller dataframe, meanwhile smaller files gain 100x to 1000x in size. I then experimented with saving the hdf5 files with 30 chunks with and without compression. It seems that when multiple dataframe with unique keys are placed in the same hdf5 file, the compression is almost non functional. I've tried all compression options with similar results.
There is a bug with compression when saving hdf5 files with multiple datasets.
I added Issue #45286
I created some simple tests, and discovered some interesting behavior.
complib='blosc',complevel=9
. As expected, the uncompressed files were larger, but the increase from 1 DF to 30 DFs was much lower (only 65% increase).Pandas Results
# of DFs | Compression | Size (MB) |
---|---|---|
1 | Blosc-9 | 3.1 |
30 | Blosc-9 | 33.5 |
1 | No | 24.8 |
30 | No | 54.8 |
So, some of this behavior could be due to compression differences with multiple, smaller dataframes. (Likely compression isn't as effective.) Also, Pandas uses an "interesting" schema to store dataframes. If you inspect the files with HDF View, the data does not look like a table (like you would see if you used PyTables or h5py to save the data). There may be overhead that is duplicated when you have multiple dataframes (but that is just an inference).
Updated 2021-01-08:
You can also save to HDF5 format using PyTables
or h5py
. (Pandas is built on top of PyTables.) Out of curiosity, I extended my example above to write HDF5 files with each package. One file has a single dataset and the other has 30 datasets (to mimic the dataframes method). I also compared effect of compression. Results provide interesting insights.
PyTables Results
# of DSs | Compression | Size (MB) |
---|---|---|
1 | Blosc-9 | 0.14 |
30 | Blosc-9 | 0.40 |
1 | Zlib-9 | 0.08 |
30 | Zlib-9 | 0.17 |
1 | No | 11.9 |
30 | No | 13.5 |
Pandas vs PyTables Observations
h5py Results
# of DSs | Compression | Size (MB) |
---|---|---|
1 | Gzip-9 | 0.15 |
30 | Gzip-9 | 0.40 |
1 | No | 11.9 |
30 | No | 11.9 |
Pandas vs h5py Observations
PyTables vs h5py Observations
Create test data (as lists):
import string
col1 = list(string.ascii_letters)
col2 = [ x for x in range(1,53)]
col3 = [ float(x) for x in range(1,53)]
ncopies = 18_000
nslices = 30
fact = len(col1)*ncopies//nslices
col_c = []; col_int = []; col_float = []
for i in range(ncopies):
col_c.extend(col1)
col_int.extend(col2)
col_float.extend(col3)
Write test data with Pandas:
import pandas as pd
dfx = pd.DataFrame({'col_c': col_c, 'col_int': col_int, 'col_float': col_float})
dfx.to_hdf('pd_file_1_blosc.h5',key='test_data',mode='a',complib='blosc',complevel=9)
dfx.to_hdf('pd_file_1_uc.h5',key='test_data',mode='a')
for i in range(nslices):
dfx_small = dfx[i*fact:(i+1)*fact]
dfx_small.to_hdf('pd_file_30_blosc.h5',key=f'd{i:02}', mode='a',complib='blosc',complevel=9)
dfx_small.to_hdf('pd_file_30_uc.h5',key=f'd{i:02}', mode='a')
Create NumPy recarray for PyTables and h5py:
import numpy as np
arr_dt = np.dtype([ ('col_c', 'S1'), ('col_int', int), ('col_float', float) ])
recarr = np.empty(shape=(len(col_c),), dtype=arr_dt)
recarr['col_c'] = col_c
recarr['col_int'] = col_int
recarr['col_float'] = col_float
Write test data with PyTables:
import tables as tb
f_blosc = tb.Filters(complib ="blosc", complevel=9)
f_zlib = tb.Filters(complib ="zlib", complevel=9)
with tb.File('tb_file_1_blosc.h5','w') as h5f:
h5f.create_table('/','test_data', obj=recarr, filters=f_blosc)
with tb.File('tb_file_1_zlib.h5','w') as h5f:
h5f.create_table('/','test_data', obj=recarr, filters=f_zlib)
with tb.File('tb_file_1_uc.h5','w') as h5f:
h5f.create_table('/','test_data', obj=recarr)
with tb.File('tb_file_30_blosc.h5','w') as h5f:
for i in range(nslices):
h5f.create_table('/',f'test_data_{i:02}', obj=recarr[i*fact:(i+1)*fact],
filters=f_blosc)
with tb.File('tb_file_30_zlib.h5','w') as h5f:
for i in range(nslices):
h5f.create_table('/',f'test_data_{i:02}', obj=recarr[i*fact:(i+1)*fact],
filters=f_zlib)
with tb.File('tb_file_30_uc.h5','w') as h5f:
for i in range(nslices):
h5f.create_table('/',f'test_data_{i:02}', obj=recarr[i*fact:(i+1)*fact])
Write test data with h5py:
import h5py
with h5py.File('h5py_file_1_gzip.h5','w') as h5f:
h5f.create_dataset('test_data', data=recarr, compression="gzip", compression_opts=9)
with h5py.File('h5py_file_1_uc.h5','w') as h5f:
h5f.create_dataset('test_data', data=recarr)
with h5py.File('h5py_file_30_gzip.h5','w') as h5f:
for i in range(nslices):
h5f.create_dataset(f'test_data_{i:02}', data=recarr[i*fact:(i+1)*fact],
compression="gzip", compression_opts=9)
with h5py.File('h5py_file_30_uc.h5','w') as h5f:
for i in range(nslices):
h5f.create_dataset(f'test_data_{i:02}', data=recarr[i*fact:(i+1)*fact])