How to find the COMPRESSION_CODEC used on a Parquet file at the time of its generation?

Asked 20/8, 2019 at 12:16 Answered 9/1, 2024 at 0:20

Usually in Impala, we use the COMPRESSION_CODEC before inserting data into a table for which the underlying files are in Parquet format.

Commands used to set COMPRESSION_CODEC:

set compression_codec=snappy;
set compression_codec=gzip;

Is it possible to find out the type of compression codec used by doing any kind of operation on the Parquet file?

Bordure answered 20/8, 2019 at 12:16 Comment(0)

One way you can find compression algorithm used by Impala parquet table is via parquet-tools. This utility comes packaged with Cloudera CDH, for example, otherwise trivially built from source.

$ parquet-tools meta <parquet-file>
creator:     impala version 2.13.0-SNAPSHOT (build 100d7da677f2c81efa6af2a5e3a2240199ae54d5)

file schema: schema
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
code:        OPTIONAL BINARY R:0 D:1
description: OPTIONAL BINARY R:0 D:1
value:       OPTIONAL INT32 O:INT_32 R:0 D:1

row group 1: RC:823 TS:20420
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
code:         BINARY GZIP DO:4 FPO:1727 SZ:2806/10130/3.61 VC:823 ENC:RLE,PLAIN_DICTIONARY
description:  BINARY GZIP DO:2884 FPO:12616 SZ:10815/32928/3.04 VC:823 ENC:RLE,PLAIN_DICTIONARY
value:        INT32 GZIP DO:17462 FPO:19614 SZ:3241/4130/1.27 VC:823 ENC:RLE,PLAIN_DICTIONARY

Since generally in Parquet (not through Impala) compression can be set column-by-column, for each parquet row group you will see compression being used against each column stats.

Horacehoracio answered 20/8, 2019 at 15:9 Comment(2)

newer parquet tools require parquet-tools inspect path/to/parquet/file command – Apomixis 8/1, 2024 at 5:21

cool! perhaps you can add a new answer, complete with reference and example? – Horacehoracio 8/1, 2024 at 13:7

Pip installed parquet tools is quite good.

To use:

Install using pip install parquet-tools
Inspect the file using parquet-tools inspect s3://path/to/parquet

https://pypi.org/project/parquet-tools/#description

Apomixis answered 9/1, 2024 at 0:20 Comment(0)

you might also want to run this on hdfs directly

hadoop jar parquet-tools-1.11.1.jar meta hdfs://a.parquet

Biogeochemistry answered 24/5, 2022 at 3:16 Comment(0)

Recommended topics

Hot tags