Avro vs. Parquet
Asked Answered
M

6

130

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data!

Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms?

Mell answered 10/3, 2015 at 6:19 Comment(0)
S
71

Avro is a Row based format. If you want to retrieve the data as a whole you can use Avro

Parquet is a Column based format. If your data consists of a lot of columns but you are interested in a subset of columns then you can use Parquet

HBase is useful when frequent updating of data is involved. Avro is fast in retrieval, Parquet is much faster.

Steric answered 29/1, 2016 at 0:45 Comment(1)
parquet stores data on disk in a hybrid manner. It does a horizontal partition of the data and stores each partition it in a columnar way.Demonstrator
H
69

If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,

job.setOutputFormatClass(AvroKeyOutputFormat.class);
AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());

for

job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());

The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.

In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.

I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.

Haigh answered 30/4, 2015 at 22:40 Comment(5)
Parquet supports nested datasets/ collections too.Charpentier
@Ruslan: Yes, it did technically support the nested structures. The problem was the very high number of columns due to extensive de-normalization of the data. It worked but it was very slow.Haigh
Yes, writing data in parquet is more expensive. Reads are other way around, especially if your queries normally read a subset of columns.Charpentier
I think Parquet is suitable for most use cases except, data in the same column varies a lot, and always analysed on almost all columns.Uncial
Apache Arrow also does not yet support mixed nesting (lists with dictionaries or dictionaries with lists). So if you want to work with complex nesting in Parquet, you're stuck with Spark, Hive, etc. and such tools that don't rely on Arrow for reading and writing Parquet.Nomination
A
59

Both Avro and Parquet are "self-describing" storage formats, meaning that both embed data, metadata information and schema when storing data in a file. The use of either storage formats depends on the use case. Three aspects constitute the basis upon which you may choose which format will be optimal in your case:

  1. Read/Write operation: Parquet is a column-based file format. It supports indexing. Because of that it is suitable for write-once and read-intensive, complex or analytical querying, low-latency data queries. This is generally used by end users/data scientists.
    Meanwhile Avro, being a row-based file format, is best used for write-intensive operation. This is generally used by data engineers. Both support serialization and compression formats, although they do so in different ways.

  2. Tools: Parquet is a good fit for Impala. (Impala is a Massive Parallel Processing (MPP) RDBM SQL-query engine which knows how to operate on data that resides in one or a few external storage engines.) Again Parquet lends itself well to complex/interactive querying and fast (low-latency) outputs over data in HDFS. This is supported by CDH (Cloudera Distribution Hadoop). Hadoop supports Apache's Optimized Row Columnar (ORC) formats (selections depends on the Hadoop distribution), whereas Avro is best suited to Spark processing.

  3. Schema Evolution: Evolving a DB schema means changing the DB's structure, therefore its data, and thus its query processing.
    Both Parquet and Avro supports schema evolution but to a varying degree.
    Parquet is good for 'append' operations, e.g. adding columns, but not for renaming columns unless 'read' is done by index.
    Avro is better suited for appending, deleting and generally mutating columns than Parquet. Historically Avro has provided a richer set of schema evolution possibilities than Parquet, and although their schema evolution capabilities tend to blur, Avro still shines in that area, when compared to Parquet.

Applecart answered 2/2, 2018 at 10:2 Comment(2)
"Tools" part is a bit misleading. Parquet is efficiently used by lots of other frameworks like Spark, Presto, Hive etc. Avro is not specific to Spark, it is widely used as a HDFS storage format and message passing scenarios like in Kafka.Griffe
Aakash Aggarwal: Can you explain what you mean in paragraph 2 with "Avro is best fit for Spark processing" ? As mentionned by devrimbaris, Parquet is very well integrated in the Spark processing environment as well. o_O ?!?Chicago
A
58

Avro

  • Widely used as a serialization platform
  • Row-based, offers a compact and fast binary format
  • Schema is encoded on the file so the data can be untagged
  • Files support block compression and are splittable
  • Supports schema evolution

Parquet

  • Column-oriented binary file format
  • Uses the record shredding and assembly algorithm described in the Dremel paper
  • Each data file contains the values for a set of rows
  • Efficient in terms of disk I/O when specific columns need to be queried

From Choosing an HDFS data storage format- Avro vs. Parquet and more

Acaroid answered 17/9, 2017 at 6:2 Comment(0)
U
15

Your understanding is right. In fact, we ran into a similar situation during data migration in our DWH. We chose Parquet over Avro as the disk saving we got was almost double than what we got with AVro. Also, the query processing time was much better than Avro. But yes, our queries were based on aggregation, column based operations etc. hence Parquet was predictably a clear winner.

We are using Hive 0.12 from CDH distro. You mentioned you are running into issues with Hive+Parquet, what are those? We did not encounter any.

Uvula answered 27/6, 2015 at 20:17 Comment(0)
P
5

Silver Blaze put description nicely with an example use case and described how Parquet was the best choice for him. It makes sense to consider one over the other depending on your requirements. I am putting up a brief description of different other file formats too along with time space complexity comparison. Hope that helps.

There are a bunch of file formats that you can use in Hive. Notable mentions are AVRO, Parquet. RCFile & ORC. There are some good documents available online that you may refer to if you want to compare the performance and space utilization of these file formats. Follows some useful links that will get you going.

This Blog Post

This link from MapR [They don't discuss Parquet though]

This link from Inquidia

The above given links will get you going. I hope this answer your query.

Thanks!

Puerperium answered 3/9, 2015 at 6:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.