Why is querying Parquet files is slower than text files in Hive?
Asked Answered
T

1

7

I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster than plain text files.

Please be noted that I am using Hive-0.13 on MapR

----------------------------------------------------------
|             | Table A | Table B | Table C |            |
----------------------------------------------------------
| Format      | Text    | Parquet | Parquet |            |
| Size[Gb]    | 2.5     | 1.9     | 1.9     |            |
| Comrepssion | N/A     | N/A     | Snappy  |            |
| CPU [sec]   | 123.33  | 204.92  | N/A     | Operation1 |
| Time [sec]  | 59.057  | 50.33   | N/A     | Operation1 |
| CPU [sec]   | 51.18   | 117.08  | N/A     | Operation2 |
| Time [sec]  | 25.296  | 27.448  | N/A     | Operation2 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation3 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation3 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation4 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation4 |
| CPU [sec]   | 127.85  | 255.2   | N/A     | Operation5 |
| Time [sec]  | 29.68   | 41.025  | N/A     | Operation5 |
  • Operation1: Row count operation
  • Operation2: Single Row Selection
  • Operation3: Multi Row Selection Using Where clause [1000 rows fetched]
  • Operation4: Multi Row Selection [with only 4 columns] Using Where clause [1000 rows fetched]
  • Operation5: Aggregation operation [Using sum function on a given column]

You can see that in almost all the operations that I have applied on both the tables, Parquet is lagging behind in terms of time taken to execute the query with an exception of row count operation.

I also used table C to perform the aforementioned operations but the results were almost on similar lines with TextFile format again was snappier of the two.

Can some one please let me know what I am doing wrong?

Thanks!

EDIT

I added ORC to the list of storage formats and ran the tests again. Follows the details.

Row count operation

Text Format Cumulative CPU - 123.33 sec

Parquet Format Cumulative CPU - 204.92 sec

ORC Format Cumulative CPU - 119.99 sec

ORC with SNAPPY Cumulative CPU - 107.05 sec

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec

Parquet Format Cumulative CPU - 255.2 sec

ORC Format Cumulative CPU - 120.48 sec

ORC with SNAPPY Cumulative CPU - 98.27 sec

Average of a column operation

Text Format Cumulative CPU - 128.79 sec

Parquet Format Cumulative CPU - 211.73 sec

ORC Format Cumulative CPU - 165.5 sec

ORC with SNAPPY Cumulative CPU - 135.45 sec

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU - 72.48 sec

Parquet Format Cumulative CPU - 136.4 sec

ORC Format Cumulative CPU - 96.63 sec

ORC with SNAPPY Cumulative CPU - 82.05 sec

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

Thanks!

Thilde answered 2/9, 2015 at 10:25 Comment(8)
Just out of curiosity: 1 did you try to pick a few columns instead of selecting all? Columnar storage is not that good at rebuilding "fat" rows out of column shreds 2 did you consider ORC (w/ fast stripe elimination, vectorized reads, etc.) as an alternative format with better support in Hive?Reentry
Smason, follows my response on your queries. #1. Yes. I did fetched a few columns from the table. The cumulative CPU was still higher for Parquet [you may check the results posted in the post]. #2. I worked on ORC after posting my question here. It's taking far less space, 652MB, taking less time then parquet too. I will edit my question and will pot complete results from ORC table.Thilde
Just out of curiosity, I was not able to find a good paper discussing Parquet and ORC neck to neck. Do you have any document that can compare both these file formats according to use cases? Also, the table that I am using here is not that wide and has 12-15 columns. Is parquet a good option here?Thilde
The ugly truth is that Cloudera is pushing Parquet+Impala while HortonWorks is pushing ORC+Hive. Lots of marketing and politics involved... But there are some format-agnostic tools like Presto e.g. zdnet.com/article/… (caveat: perf may change drastically with new versions and with config tuning)Reentry
What is surprising to me is the time it is taking you. What hardware are you using? And always post code for benchmarks.Pediatrics
@SamsonScharfrichter Interesting... I think the cool kids are using Spark+Parquet.Pediatrics
There's not enough information to answer, in particular, there's no information on the schemas of the table you used, not on the statistics of the data. Where parquet excels is tables with hundreds of columns, sparsely populated (storing nulls is basically free), where values are somewhat repeated. If you did your tests with randomly generated strings for example, and with just a few columns, the test would not be significant.Shafting
I also have similar situation hereLifesaving
P
0

First I would like to just point out that it is virtually impossible to answer your question with the given details.

Few points:

  • measuring time in a distributed environment is not the way to determine if something is slow (if you have many queries running and competing for resources you are not measuring what you think you are measuring)

  • not providing the actual table definition and the queries running against those tables makes this problem impossible to reproduce

  • not providing the number of rows of the table and the cardinality its individual fields is also not helping

In general, querying Parquet is much faster than querying text files because Parquet employs many things to make read operations much faster. Few of these things:

  • compression
  • run length encoding
  • dictionary encoding

Depending on the use case some of the parameters of things can be tuned to the exact use case.

Purdah answered 16/11, 2020 at 14:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.