Spark on Parquet vs Spark on Hive(Parquet format)

Asked 9/11, 2017 at 17:21 Answered 10/11, 2017 at 6:34

Our use case is a narrow table(15 fields) but large processing against the whole dataset(billions of rows). I am wondering what combination provides better performance:

env: CDH5.8 / spark 2.0

Spark on Hive tables(as format of parquet)
Spark on row files(parquet)

Weighty answered 9/11, 2017 at 17:21 Comment(2)

There are known issues about Scala lambdas being slower that SparkSQL expressions (that use scalar types directly, no round-trip to Objects) but it's usually marginal. And ORC vectorized reader is scheduled for Spark 2.3 if I remember well, while Parquet already has vectorization support. Other than that... I'm an old SQL user who finds Scala portmanteau expressions ridiculous, like so many sausage strings, but that's my personal opinion (set-based semantics, baby!) – Kugler 9/11, 2017 at 17:47

SparkSQL on row files(parquet or ORC). what do you mean by row files? orc is columnar storage right – Donar 9/11, 2017 at 20:25

There's only two options here. Spark on files, or Spark on Hive. SparkSQL works on both, and you should prefer to use the Dataset API, not RDD

If you can define the Dataset schema yourself, Spark reading the raw HDFS files will be faster because you're bypassing the extra hop to the Hive Metastore.

When I did a simple test myself years ago (with Spark 1.3), I noticed that extracting 100000 rows as a CSV file was orders of magnitude faster than a SparkSQL Hive query with the same LIMIT

Flavorsome answered 10/11, 2017 at 6:34 Comment(0)

Without additional context of your specific product and usecase - I'd vote for SparkSql on Hive tables for two reasons:

sparksql is usually better than core spark since databricks wrote different optimizations in sparksql, which is higher abstaction and gives ability to optimize code(read about Project Tungsten). In some cases manually written spark core code will be better, but it demands from the programmer deep understanding of the internals. In addition sparksql sometimes is limited and doesn't permit you to control low-level mechanisms, but you can always fallback to work with core rdd.
hive and not files - I'm assuming hive with external metastore. Metastore saves definitions of partitions of your "tables"(in files it could be some directory). This is one of the most important parts for the good performance. I.e. when working with files spark will need to load this info(which could be time consuming - e.g. s3 list operation is very slow). So metastore permits spark to fetch this info in simple and fast way

Hiroko answered 9/11, 2017 at 18:58 Comment(0)

Recommended topics

Hot tags