I am starting to work with the parquet file format. The official Apache site recommends large row groups of 512MB to 1GB (here). Several online source (e.g. this one) suggest that the default row group size is 128MB.
I have a large number of parquet files which I will later process downstream with PySpark on AWS Glue. These files have very small row-groups. I cannot control the files I'm starting with, but want to combine row-groups so as to have "more efficient" files prior to downstream processing (why? these files will be uploaded to S3 and processed with Spark; my understanding is that Spark will read one row-group at a time, so many more smaller row-groups results in increased IO operations which is inefficient; if this assumption is invalid please educate me).
Let's consider just one of these files for this question. It's compressed (with snappy
compression) and 85MB on disk. When I inspect its schema using the pqrs
tool it reports that the file has 55,733 records in 1,115 row groups, and each row group seems to be around 500 kB - specifically, something like this:
row group 7:
--------------------------------------------------------------------------------
total byte size: 424752
num of rows: 50
If I simply take (1115 row-groups * 500 kB/row-group) that's around 500MB; whereas the file on disk is 85MB. Granted, some of the row-groups are smaller than 500kB but I eyeballed around 100 of them (half at top, half at bottom) and they're in that general ballpark.
Sub-question 1: is the difference (500MB calculated vs 85MB actual) because the row-group size reported by pqrs
actually represents the uncompressed size, maybe what would be the in-memory size of the row-group (which presumably would be larger than the compressed serialized size on disk)? In other words I can't do a simplistic 1115 * 500 but have to apply some sort of compression factor?
Sub-question 2: when I see that the recommended batch size is 128MB, what exactly does that refer to? The uncompressed in-memory size? The serialized, compacted size on disk? Something else? How does it relate to what's reported by pqrs
?
My (simplified) code to compact these row-groups is:
import pyarrow.dataset as ds
import pyarrow.parquet as pq
def compact_parquet_in_batches(infile, outfile, batchsize):
parquet_file = pq.ParquetFile(infile)
ds.write_dataset(
parquet_file.iter_batches(batch_size=batchsize),
outfile,
schema=RSCHEMA,
format='parquet'
)
Main question: What should batchsize
be?
iter_batches
takes batch_size
as a number of records rather than a byte size. I could calculate it from total records and desired # of batches, but I'm unclear what I should be calculating for here.
I tried this:
- required # batches = file size on disk in MB / 128 = 85/128 = 1 (rounded up)
- batch size = # records / required # batches = 55,733 / 1 = 60000 (rounded up to next 10k)
When I run my code with batch size of 60k:
- I get two record groups (great, 1,115 is down to 2; but why not to 1?)
- the reported byte size of the first record group is around 250MB. So even though it ended up creating twice the number of row-groups I expected, instead of each being half the size I expected they are actually double the size I expected.
row group 0:
--------------------------------------------------------------------------------
total byte size: 262055359
num of rows: 32768
I figure some of my assumptions - or understanding about the parquet file format, the pqrs
tool or the pyarrow
library - are off. Can someone please demystify me?