HDF5 Storage Overhead
Asked Answered
Y

2

10

I'm writing a large number of small datasets to an HDF5 file, and the resulting filesize is about 10x what I would expect from a naive tabulation of the data I'm putting in. My data is organized hierarchically as follows:

group 0
    -> subgroup 0
        -> dataset (dimensions: 100 x 4, datatype: float)
        -> dataset (dimensions: 100, datatype: float)
    -> subgroup 1
        -> dataset (dimensions: 100 x 4, datatype: float)
        -> dataset (dimensions: 100, datatype: float)
    ...
group 1
...

Each subgroup should take up 500 * 4 Bytes = 2000 Bytes, ignoring overhead. I don't store any attributes alongside the data. Yet, in testing, I find that each subgroup takes up about 4 kB, or about twice what I would expect. I understand that there is some overhead, but where is it coming from, and how can I reduce it? Is it in representing the group structure?

More information: If I increase the dimensions of the two datasets in each subgroup to 1000 x 4 and 1000, then each subgroup takes up about 22,250 Bytes, rather than the flat 20,000 Bytes I expect. This implies an overhead of 2.2 kB per subgroup, and is consistent with the results I was getting with the smaller dataset sizes. Is there any way to reduce this overhead?

Yttriferous answered 15/1, 2013 at 6:24 Comment(3)
The HDF5 file format is extremely complex. It uses internal blocking to store data and metadata objects. The default block size for metadata is 2 KiB and each (sub-)group has its own header space, which explains the observed difference of about 2000 bytes. You might try and experiment with COMPACT storage - see (4.5) here for more information on storage strategies.Carbohydrate
The numbers I gave above are with COMPACT set. The lesson from this is to avoid complicated group structures housing small amounts of data. After combining all of my datasets into a larger array and applying compression, I get a better than 1:1 packing ratio (the compression saves more space than the HDF5 overhead adds).Yttriferous
@Thucydides411 your comment is the best answer! You should write it in an answer and accept it.Rah
Y
6

I'll answer my own question. The overhead involved just in representing the group structure is enough that it doesn't make sense to store small arrays, or to have many groups, each containing only a small amount of data. There does not seem to be any way to reduce the overhead per group, which I measured at about 2.2 kB.

I resolved this issue by combining the two datasets in each subgroup into a (100 x 5) dataset. Then, I eliminated the subgroups, and combined all of the datasets in each group into a 3D dataset. Thus, if I had N subgroups previously, I now have one dataset in each group, with shape (N x 100 x 5). I thus save the N * 2.2 kB overhead that was previously present. Moreover, since HDF5's built-in compression is more effective with larger arrays, I now get a better than 1:1 overall packing ratio, whereas before, overhead took up half the space of the file, and compression was completely ineffective.

The lesson is to avoid complicated group structures in HDF5 files, and to try to combine as much data as possible into each dataset.

Yttriferous answered 8/3, 2013 at 3:5 Comment(3)
Yes... and no. HDF5 was created by scientists to store massive datasets. I think obsessing over 2Kb is to miss the point. If you are so space constrained then this is probably the wrong library for you. It is ALWAYS worth while trying to make data as self describing as possible, even if it takes a few Kb to do so. You can build 'clever' data structures, just as you can write 'clever' code, but Moore's law is on the side of people who write maintainable code and self describing data structures.Anthologize
I think I went over this in my answer. 2kB per dataset is certainly a concern if you're storing large numbers of small datasets. My answer, above, is to pack the data into larger datasets, if possible. I didn't propose a complicated structure: a 3D dataset, where each axis has a meaning, is pretty simple.Yttriferous
consider using JSON or binary JSON for such data, it is more portable and versatile compared to HDF5. If storing scientific data structures (such as ND-arrays, tables) is needed with JSON, consider using JData annotations.Paean
J
0

There was recently some work in this direction which became available in 1.10.5. There is now a function called H5Fset_dset_no_attrs_hint which sets the file or dataset property creation list not to allocate as much space in the object header.

Jabez answered 22/7, 2023 at 8:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.