Our use case is a narrow table(15 fields) but large processing against the whole dataset(billions of rows). I am wondering what combination provides better performance:
env: CDH5.8 / spark 2.0
- Spark on Hive tables(as format of parquet)
- Spark on row files(parquet)