What is actually meant when referring to parquet row-group size?

import pyarrow.dataset as ds import pyarrow.parquet as pq def compact_parquet_in_batches(infile, outfile, batchsize): parquet_file = pq.ParquetFile(infile) ds.write_dataset( parquet_file.iter_batches(batch_size=batchsize), outfile, schema=RSCHEMA, format='parquet' )

TL;DR - 1 Mi rows

Your understanding is roughly correct. Different tools have different recommendations and some tools (e.g. pyarrow) will use # of rows to determine row group size and other tools (e.g. parquet-mr, the java parquet implementation used by spark) will use # of bytes.

is the difference (500MB calculated vs 85MB actual) because the row-group size reported by pqrs actually represents the uncompressed size

Yes. Tools are often not very clear on this. I find the parquet thrift definition to be a good source of ground truth when dealing with parquet metadata fields.

struct RowGroup {
  /** Metadata for each column chunk in this row group.
   * This list must have the same order as the SchemaElement list in FileMetaData.
   **/
  1: required list<ColumnChunk> columns

  /** Total byte size of all the uncompressed column data in this row group **/
  2: required i64 total_byte_size

when I see that the recommended batch size is 128MB, what exactly does that refer to? The uncompressed in-memory size? The serialized, compacted size on disk? Something else? How does it relate to what's reported by pqrs?

Main question: What should batchsize be?

The answer usually comes down to ensuring that you are making I/O requests that are ideal for your storage system. However, if your row groups are very small (e.g. 100, 1k, 10k rows) then it probably doesn't matter what your storage is (both because row groups introduce extra compute and because row groups affect the metadata / data ratio). These very small sizes are almost always universally bad for performance.

If you are using HDFS I believe the rules may be slightly different. I don't have much experience with HDFS. In all other cases you generally want row groups to be large enough that your I/O requests are big enough to satisfy your storage system.

For example, when reading from a HDD, if you do a bunch of random 64 byte reads you will get worse performance than a bunch of sequential 64 byte reads. However, if you do a bunch of random 4MiB reads then you should get roughly the same performance as a bunch of sequential 4MiB reads. This will depend on the hard drive of course but I've found 4MiB to be a good number for HDD. On the other hand, if you are reading from S3, then their guidelines recommend 8-16MiB.

Translating this to row group size is a bit tricky and will depend on your query habits. If you normally read all columns from a file then you will want your row group to be 8-16MiB. On the other hand, if you normally only read "some" columns from a file then you want each column to be 8-16MiB.

Now things get tricky because we have to think about compression and encoding. For example, a boolean column is almost never 8MiB. You would need at least 64Mi rows and possibly quite a bit more due to compression. A float32 column is a little easier to reason with. You should get an 8MiB read with 2Mi rows and, in many cases, you don't get much compression from this.

All of the above is the theory. In practice I have done a fair amount of benchmarking, both on local disks, and on S3, and I have found that 1Mi rows is generally a good size for a row group. There are probably cases where larger row groups are a good idea and you can still get pretty good performance with smaller row groups. You'll want to eventually benchmark for your own personal use case. However, 1Mi is a nice round number that is easy to remember. If you need to express your row group size in # of uncompressed bytes then it depends on how many columns you have. Again, as a rule of thumb, we can assume columns are 4 bytes, and so you can use the calculation:

# of bytes = 1Mi * # of columns * 4

In other words, if you have 10 columns, then aim for row groups with at least 40MiB.

What if I make row groups too large?

Given the above, it may seem simple to just make the row groups massive. This will ensure that you have ideal I/O requests. In a perfect world, were all parquet readers created equal, then I would say this is correct (1 row group per file is ideal).

However, many parquet readers will use row groups either as:

The unit of parallelism - In this case you won't get enough parallelism if you only have one row group in your file.
The minimum size of a read - In this case the RAM consumption of your reader will be very high, even if you are doing streaming processing.

For these reasons you will typically want to avoid overly large row group sizes.

What about pushdown filtering / statistics / etc.

This is typically another reason to keep row groups small. Row group statistics are the easiest pushdown filtering tool to use and some readers rely exclusively on this. This means that the pushdown is only capable of filtering out entire row groups based on the filter. So smaller row groups means you have a better chance of eliminating I/O entirely.

Fortunately, parquet reading tools have been slowly moving towards using page-level statistics (or page level bloom filters) to do this filtering. Pages are quite small (~1MiB) and can offer a very good resolution for filtering (though in some rare cases it is too fine of a resolution as it require more metadata processing). If your parquet reader is able to utilize page level statistics for pushdown then the row group size should have no affect on pushdown.

Any kind of skipping or loading of single rows will be unrelated to row groups sizes. Parquet readers should be capable of applying skips at page-level resolution.

Disclaimer: I work on arrow-c++/pyarrow. The pyarrow datasets reader's performance is very dependent on row group size (I'm slowly trying to fix this) for some of the reasons I describe above.

What if I make row groups too large?

What about pushdown filtering / statistics / etc.

Recommended topics

Hot tags