From what I understand, even though in general .ORC
is better suited for flat structures and parquet
for nested ones, spark
is optimised towards parquet
. Therefore, it is advised to use that format with spark
.
Furthermore, Metadata
for all your read tables from parquet
will be stored in hive
anyway. This is spark doc:Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.
I tend to transform data asap into parquet
format and store it alluxio
backed by hdfs
. This allows me to achieve better performance for read/write
operations, and limit using cache
.
I hope it helps.