I'm writing a large number of small datasets to an HDF5 file, and the resulting filesize is about 10x what I would expect from a naive tabulation of the data I'm putting in. My data is organized hierarchically as follows:
group 0
-> subgroup 0
-> dataset (dimensions: 100 x 4, datatype: float)
-> dataset (dimensions: 100, datatype: float)
-> subgroup 1
-> dataset (dimensions: 100 x 4, datatype: float)
-> dataset (dimensions: 100, datatype: float)
...
group 1
...
Each subgroup should take up 500 * 4 Bytes = 2000 Bytes, ignoring overhead. I don't store any attributes alongside the data. Yet, in testing, I find that each subgroup takes up about 4 kB, or about twice what I would expect. I understand that there is some overhead, but where is it coming from, and how can I reduce it? Is it in representing the group structure?
More information: If I increase the dimensions of the two datasets in each subgroup to 1000 x 4 and 1000, then each subgroup takes up about 22,250 Bytes, rather than the flat 20,000 Bytes I expect. This implies an overhead of 2.2 kB per subgroup, and is consistent with the results I was getting with the smaller dataset sizes. Is there any way to reduce this overhead?
COMPACT
storage - see (4.5) here for more information on storage strategies. – CarbohydrateCOMPACT
set. The lesson from this is to avoid complicated group structures housing small amounts of data. After combining all of my datasets into a larger array and applying compression, I get a better than 1:1 packing ratio (the compression saves more space than the HDF5 overhead adds). – Yttriferous