Is it better for Spark to select from hive or select from file
Asked Answered
G

2

11

I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why?

Mike

Gine answered 22/5, 2017 at 18:53 Comment(2)
You may find the most recent answer to #32373960 quite interesting >> vectorization is a real bonus... As well as "predicate push-down", "skip scans" based on local stats (both features require columnar formats), efficient partition pruning, appropriate compression...Aneroid
Good question Big MikeYapon
G
9

tl;dr : I would read it straight from the parquet files

I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are

val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet")
val dfhive = sqlContext.table("db.table")

dffile count --> 0.38s; dfhive count --> 8.99s

dffile sum(col) --> 0.98s; dfhive sum(col) --> 8.10s

dffile substring(col) --> 2.63s; dfhive substring(col) --> 7.77s

dffile where(col=value) --> 82.59s; dfhive where(col=value) --> 157.64s

Note that these were done with an older version of Hive and an older version of Spark so I can't comment on how speed improvements could have occurred between the two reading mechanisms

Grassi answered 22/5, 2017 at 20:11 Comment(0)
C
1

From what I understand, even though in general .ORC is better suited for flat structures and parquet for nested ones, spark is optimised towards parquet. Therefore, it is advised to use that format with spark.

Furthermore, Metadata for all your read tables from parquet will be stored in hiveanyway. This is spark doc:Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

I tend to transform data asap into parquet format and store it alluxio backed by hdfs. This allows me to achieve better performance for read/write operations, and limit using cache.

I hope it helps.

Ciliolate answered 22/5, 2017 at 19:57 Comment(2)
That's cool that you are using Alluxio! But my question is specifically about Hive vs Non-Hive. Whether it's ORC, Parquet, whatever, that could be defined by something else. It might be a requirement to have ORC. I am just asking about Hive table vs just the file.Gine
Even though I think that Spark is still better with parquet, it has to be mentioned that Spark continuously works on its ORC-support which is available since Spark 2.3 (spark.apache.org/docs/latest/sql-data-sources-orc.html). So this might change in the future.Arnaud

© 2022 - 2024 — McMap. All rights reserved.