I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster than plain text files.
Please be noted that I am using Hive-0.13 on MapR
----------------------------------------------------------
| | Table A | Table B | Table C | |
----------------------------------------------------------
| Format | Text | Parquet | Parquet | |
| Size[Gb] | 2.5 | 1.9 | 1.9 | |
| Comrepssion | N/A | N/A | Snappy | |
| CPU [sec] | 123.33 | 204.92 | N/A | Operation1 |
| Time [sec] | 59.057 | 50.33 | N/A | Operation1 |
| CPU [sec] | 51.18 | 117.08 | N/A | Operation2 |
| Time [sec] | 25.296 | 27.448 | N/A | Operation2 |
| CPU [sec] | 57.55 | 113.97 | N/A | Operation3 |
| Time [sec] | 20.254 | 27.678 | N/A | Operation3 |
| CPU [sec] | 57.55 | 113.97 | N/A | Operation4 |
| Time [sec] | 20.254 | 27.678 | N/A | Operation4 |
| CPU [sec] | 127.85 | 255.2 | N/A | Operation5 |
| Time [sec] | 29.68 | 41.025 | N/A | Operation5 |
- Operation1: Row count operation
- Operation2: Single Row Selection
- Operation3: Multi Row Selection Using Where clause [1000 rows fetched]
- Operation4: Multi Row Selection [with only 4 columns] Using Where clause [1000 rows fetched]
- Operation5: Aggregation operation [Using sum function on a given column]
You can see that in almost all the operations that I have applied on both the tables, Parquet is lagging behind in terms of time taken to execute the query with an exception of row count operation.
I also used table C to perform the aforementioned operations but the results were almost on similar lines with TextFile format again was snappier of the two.
Can some one please let me know what I am doing wrong?
Thanks!
EDIT
I added ORC to the list of storage formats and ran the tests again. Follows the details.
Row count operation
Text Format Cumulative CPU - 123.33 sec
Parquet Format Cumulative CPU - 204.92 sec
ORC Format Cumulative CPU - 119.99 sec
ORC with SNAPPY Cumulative CPU - 107.05 sec
Sum of a column operation
Text Format Cumulative CPU - 127.85 sec
Parquet Format Cumulative CPU - 255.2 sec
ORC Format Cumulative CPU - 120.48 sec
ORC with SNAPPY Cumulative CPU - 98.27 sec
Average of a column operation
Text Format Cumulative CPU - 128.79 sec
Parquet Format Cumulative CPU - 211.73 sec
ORC Format Cumulative CPU - 165.5 sec
ORC with SNAPPY Cumulative CPU - 135.45 sec
Selecting 4 columns from a given range using where clause
Text Format Cumulative CPU - 72.48 sec
Parquet Format Cumulative CPU - 136.4 sec
ORC Format Cumulative CPU - 96.63 sec
ORC with SNAPPY Cumulative CPU - 82.05 sec
Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?
Thanks!