I would like to ask a question about the performance of compression which is related to chunk size of hdf5 files.
I have 2 hdf5 files on hand, which have the following properties. They both only contain one dataset, called "data".
File A's "data":
- Type: HDF5 Scalar Dataset
- No. of Dimensions: 2
- Dimension Size: 5094125 x 6
- Max. dimension size: Unlimited x Unlimited
- Data type: 64-bit floating point
- Chunking: 10000 x 6
- Compression: GZIP level = 7
File B's "data":
- Type: HDF5 Scalar Dataset
- No. of Dimensions: 2
- Dimension Size: 6720 x 1000
- Max. dimension size: Unlimited x Unlimited
- Data type: 64-bit floating point
- Chunking: 6000 x 1
- Compression: GZIP level = 7
File A's size: HDF5----19 MB CSV-----165 MB
File B's size: HDF5----60 MB CSV-----165 MB
Both of them shows great compression on data stored when comparing to csv files. However, the compression rate of file A is about 10% of original csv, while that of file B is only about 30% of original csv.
I have tried different chunk size to make file B as small as possible, but it seems that 30% is the optimum compression rate. I would like to ask why file A can achieve a greater compression while file B cannot.
If file B can also achieve, what should the chunk size be?
Is that any rule to determine the optimum chunk size of HDF5 for compression purpose?
Thanks!