Write parquet file with Snappy compression in Apache Beam
Asked Answered
L

1

6

I am trying to write a parquet file as follow in Apache Beam using Snappy compression

records.apply(FileIO.<GenericRecord>write().via(ParquetIO.sink(schema)).to(options.getOutput()));

I see that is possible to set AUTO,GZIP,BZIP2,ZIP and DEFLATE as compression but I am unable to find a way to set it as SNAPPY, any ideas how to do this? As reference, it is possible to do using wrting to avro as follow

records.apply("writeAvro", AvroIO.writeGenericRecords(schema).withCodec(CodecFactory.snappyCodec()).to(options.getOutput()));
Lillalillard answered 29/11, 2018 at 16:28 Comment(0)
K
3

Good news! Soon after your question, the withCompressionCodec(...) codec was added to the ParquetIO sink. This is available from Apache Beam 2.11.0 on.

You may have been looking at the FileIO.Write abstract classes that allow you to use withCompression(Compression), which takes enum that does not include SNAPPY. If it were used, it would compress the entire file with the specified compression type, which would be inappropriate for Parquet. The method above specifies how to compress the row groups internally in the file.

Fortunately, the ParquetIO prevents you from making this mistake. Only the correct compression configuration method is exposed.

Kleenex answered 8/8, 2019 at 10:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.