Compression performance related to chunk size in hdf5 files
Asked Answered
J

1

8

I would like to ask a question about the performance of compression which is related to chunk size of hdf5 files.

I have 2 hdf5 files on hand, which have the following properties. They both only contain one dataset, called "data".

File A's "data":

  1. Type: HDF5 Scalar Dataset
  2. No. of Dimensions: 2
  3. Dimension Size: 5094125 x 6
  4. Max. dimension size: Unlimited x Unlimited
  5. Data type: 64-bit floating point
  6. Chunking: 10000 x 6
  7. Compression: GZIP level = 7

File B's "data":

  1. Type: HDF5 Scalar Dataset
  2. No. of Dimensions: 2
  3. Dimension Size: 6720 x 1000
  4. Max. dimension size: Unlimited x Unlimited
  5. Data type: 64-bit floating point
  6. Chunking: 6000 x 1
  7. Compression: GZIP level = 7

File A's size: HDF5----19 MB CSV-----165 MB

File B's size: HDF5----60 MB CSV-----165 MB

Both of them shows great compression on data stored when comparing to csv files. However, the compression rate of file A is about 10% of original csv, while that of file B is only about 30% of original csv.

I have tried different chunk size to make file B as small as possible, but it seems that 30% is the optimum compression rate. I would like to ask why file A can achieve a greater compression while file B cannot.

If file B can also achieve, what should the chunk size be?

Is that any rule to determine the optimum chunk size of HDF5 for compression purpose?

Thanks!

Josephson answered 28/5, 2013 at 7:40 Comment(2)
I guess the compression probably also depends on the similarity of the data inside a specific chunk. So it's hard to say why there is a difference. For more information on chunking and performance refer to: - github.com/h5py/h5py/wiki/Guide-To-Compression - hdfgroup.org/HDF5/doc/UG/index.html - pytables.org/moin/HowToUse#PresentationsDari
Thanks, I agree that it's hard to explain the difference, although the compression ratio is low indeed. Furthermore, i wonder if it's related to the dimension of dataset, say 100 x 100 and 1000 x 10 can have different compression performances even with the same data inside.Josephson
K
10

Chunking doesn't really affect the compression ratio per se, except in the manner @Ümit describes. What chunking does do is affect the I/O performance. When compression is applied to an HDF5 dataset, it is applied to whole chunks, individually. This means that when reading data from a single chunk in a dataset, the entire chunk must be decompressed - possibly involving a whole lot more I/O, depending on the size of the cache, shape of the chunk, etc.

What you should do is make sure that the chunk shape matches how you read/write your data. If you generally read a column at a time, make your chunks columns, for example. This is a good tutorial on chunking.

Kira answered 31/5, 2013 at 13:8 Comment(4)
I agree that chunking is related to I/O performance more than compression performance. For I/O performance, I have a further question, if the dataset is fixed in dimension, like 10000 x 6, i think (1000,6) of chunk size is appropriate as i read it by row. However, if the dimension is dynamic in nature, say the no. of columns and rows will increase through the time. How should be the chunk size?Josephson
Yes, that's a good size. Do they increase by a fixed amount each time? If they do, I would suggest starting with that size. For example, if you always increase the dimensions by (500, 3), make your chunks (500, 3). It also depends on whether you do more reading than writing, or vice versa. If it's write-once, read-many, make your chunks conform to how you read the data, for example. Of course, you may still want to take some measurements and refine your chunk size!Kira
also unlike common belief, compression can actually improve read performance. But that's only provided that your chunk size corresponds to the way you read the data (see @Kira comments). The reason why reading compressed data might be faster than un-compressed is because fast multi-threaded compression libraries (i.e blosc in pyTables or lzf in h5py) are very fast and efficient. With huge datasets I/O is actually the bottleneck not the CPU performance which is relevant for compression. See this article.Dari
I see. I have created several hdf5 files with the same data and different chunk size, and compare their file sizes and reading time length. It is possible to achieve high compression with good performance of reading. I plan to chunk the data by an estimated dimension which is read each time. Thanks for all your help!Josephson

© 2022 - 2024 — McMap. All rights reserved.