Why is querying Parquet files is slower than text files in Hive?

I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster than plain text files.

Please be noted that I am using Hive-0.13 on MapR

----------------------------------------------------------
|             | Table A | Table B | Table C |            |
----------------------------------------------------------
| Format      | Text    | Parquet | Parquet |            |
| Size[Gb]    | 2.5     | 1.9     | 1.9     |            |
| Comrepssion | N/A     | N/A     | Snappy  |            |
| CPU [sec]   | 123.33  | 204.92  | N/A     | Operation1 |
| Time [sec]  | 59.057  | 50.33   | N/A     | Operation1 |
| CPU [sec]   | 51.18   | 117.08  | N/A     | Operation2 |
| Time [sec]  | 25.296  | 27.448  | N/A     | Operation2 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation3 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation3 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation4 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation4 |
| CPU [sec]   | 127.85  | 255.2   | N/A     | Operation5 |
| Time [sec]  | 29.68   | 41.025  | N/A     | Operation5 |

Operation1: Row count operation
Operation2: Single Row Selection
Operation3: Multi Row Selection Using Where clause [1000 rows fetched]
Operation4: Multi Row Selection [with only 4 columns] Using Where clause [1000 rows fetched]
Operation5: Aggregation operation [Using sum function on a given column]

You can see that in almost all the operations that I have applied on both the tables, Parquet is lagging behind in terms of time taken to execute the query with an exception of row count operation.

I also used table C to perform the aforementioned operations but the results were almost on similar lines with TextFile format again was snappier of the two.

Can some one please let me know what I am doing wrong?

Thanks!

EDIT

I added ORC to the list of storage formats and ran the tests again. Follows the details.

Row count operation

Text Format Cumulative CPU - 123.33 sec

Parquet Format Cumulative CPU - 204.92 sec

ORC Format Cumulative CPU - 119.99 sec

ORC with SNAPPY Cumulative CPU - 107.05 sec

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec

Parquet Format Cumulative CPU - 255.2 sec

ORC Format Cumulative CPU - 120.48 sec

ORC with SNAPPY Cumulative CPU - 98.27 sec

Average of a column operation

Text Format Cumulative CPU - 128.79 sec

Parquet Format Cumulative CPU - 211.73 sec

ORC Format Cumulative CPU - 165.5 sec

ORC with SNAPPY Cumulative CPU - 135.45 sec

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU - 72.48 sec

Parquet Format Cumulative CPU - 136.4 sec

ORC Format Cumulative CPU - 96.63 sec

ORC with SNAPPY Cumulative CPU - 82.05 sec

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

Thanks!

First I would like to just point out that it is virtually impossible to answer your question with the given details.

Few points:

measuring time in a distributed environment is not the way to determine if something is slow (if you have many queries running and competing for resources you are not measuring what you think you are measuring)
not providing the actual table definition and the queries running against those tables makes this problem impossible to reproduce
not providing the number of rows of the table and the cardinality its individual fields is also not helping

In general, querying Parquet is much faster than querying text files because Parquet employs many things to make read operations much faster. Few of these things:

compression
run length encoding
dictionary encoding

Depending on the use case some of the parameters of things can be tuned to the exact use case.

Recommended topics

Hot tags