Spark with Avro, Kryo and Parquet

Asked 14/6, 2015 at 13:30 Answered 18/6, 2015 at 9:4

I'm struggling to understand what exactly Avro, Kryo and Parquet do in the context of Spark. They all are related to serialization but I've seen them used together so they can't be doing the same thing.

Parquet describes its self as a columnar storage format and I kind of get that but when I'm saving a parquet file can Arvo or Kryo have anything to do with it? Or are they only relevant during the spark job, ie. for sending objects over the network during a shuffle or spilling to disk? How do Arvo and Kryo differ and what happens when you use them together?

Elissaelita answered 14/6, 2015 at 13:30 Comment(0)

Parquet works very well when you need to read only a few columns when querying your data. However if your schema has lots of columns (30+) and in your queries/jobs you need to read all of them then record based formats (like AVRO) will work better/faster.

Another limitation of Parquet is that it is essentially write-once format. So usually you need to collect data in some staging area and write it to a parquet file once a day (for example).

This is where you might want to use AVRO. E.g. you can collect AVRO-encoded records in a Kafka topic or local files and have a batch job that converts all of them to Parquet file at the end of the day. This is fairly easy to implement thanks to parquet-avro library that provides tools to convert between AVRO and Parquet formats automatically.

And of course you can use AVRO outside of Spark/BigData. It is fairly good serialization format similar to Google Protobuf or Apache Thrift.

Bonnes answered 18/6, 2015 at 9:4 Comment(0)

This very good blog post explains the details for everything but Kryo.

http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/

Kryo would be used for fast serialization not involving permanent storage, such as shuffle data and cached data, in memory or on disk as temp files.

Entail answered 14/6, 2015 at 13:49 Comment(2)

So if Parquet is for efficient permanent storage and Kryo is for fast non - permanent storage then what does Arvo do? And when would I use it? – Elissaelita 14/6, 2015 at 14:43

Kryo - very fast, very compact, but it works only on JVM, it will limit our infrastructure to only JVM applications. Maybe some crazy NodeJS developer would also like to read our events?? – Multiplicate 2/3, 2020 at 6:6

Recommended topics

Hot tags