Using AWS Firehose I am converting incoming records to parquet. In one example, I have 150k identical records enter firehose, and a single 30kb parquet gets written to s3. Because of how firehose partitions data, we have a secondary process (lambda triggered by s3 put event) read in the parquet and repartitions it based on the date within the event itself. After this repartitioning process, the 30kb file size jumps to 900kb.
Inspecting both parquet files-
- The meta doesn't change
- The data doesn't change
- They both use SNAPPY compression
- The firehose parquet is created by parquet-mr, the pyarrow generated parquet is created by parquet-cpp
- The pyarrow generated parquet has additional pandas headers
The full repartitioning process-
import pyarrow.parquet as pq
tmp_file = f'{TMP_DIR}/{rand_string()}'
s3_client.download_file(firehose_bucket, key, tmp_file)
pq_table = pq.read_table(tmp_file)
pq.write_to_dataset(
pq_table,
local_partitioned_dir,
partition_cols=['year', 'month', 'day', 'hour'],
use_deprecated_int96_timestamps=True
)
I imagine there would be some size change, but I was surprised to find such a big difference. Given the process i've described, what would cause the source parquet to go from 30kb to 900kb?