I'm writing a large number of small datasets to an HDF5 file, and the resulting filesize is about 10x what I would expect from a naive tabulation of the data I'm putting in. My data is organized hierarchically as follows:
group 0
-> subgroup 0
-> dataset (dimensions: 100 x 4, datatype: float)
-> dataset (dimensions: 100, datatype: float)
-> subgroup 1
-> dataset (dimensions: 100 x 4, datatype: float)
-> dataset (dimensions: 100, datatype: float)
...
group 1
...
Each subgroup should take up 500 * 4 Bytes = 2000 Bytes, ignoring overhead. I don't store any attributes alongside the data. Yet, in testing, I find that each subgroup takes up about 4 kB, or about twice what I would expect. I understand that there is some overhead, but where is it coming from, and how can I reduce it? Is it in representing the group structure?
More information: If I increase the dimensions of the two datasets in each subgroup to 1000 x 4 and 1000, then each subgroup takes up about 22,250 Bytes, rather than the flat 20,000 Bytes I expect. This implies an overhead of 2.2 kB per subgroup, and is consistent with the results I was getting with the smaller dataset sizes. Is there any way to reduce this overhead?
I'll answer my own question. The overhead involved just in representing the group structure is enough that it doesn't make sense to store small arrays, or to have many groups, each containing only a small amount of data. There does not seem to be any way to reduce the overhead per group, which I measured at about 2.2 kB.
I resolved this issue by combining the two datasets in each subgroup into a (100 x 5) dataset. Then, I eliminated the subgroups, and combined all of the datasets in each group into a 3D dataset. Thus, if I had N subgroups previously, I now have one dataset in each group, with shape (N x 100 x 5). I thus save the N * 2.2 kB overhead that was previously present. Moreover, since HDF5's built-in compression is more effective with larger arrays, I now get a better than 1:1 overall packing ratio, whereas before, overhead took up half the space of the file, and compression was completely ineffective.
The lesson is to avoid complicated group structures in HDF5 files, and to try to combine as much data as possible into each dataset.