Hive/Impala performance with string partition key vs Integer partition key
Asked Answered
R

2

5

Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions?

Reconvert answered 29/8, 2018 at 16:24 Comment(1)
As an aside, having the partition key as an integer would help in performing arithmetic comparisons on it (for example, for partition purging) easier.Inter
L
4

No, there is no such recommendation. Consider this: The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed inside data files and not compressed.

Due to the distributed/parallel nature of map-reduce and Impalla, you will never notice the difference in query processing performance. Also all data will be serialized to be passed between processing stages, then again deserialized and cast to some type, this can happen many times for the same query.

There are a lot of overhead created by distributed processing and serializing/deserializing data. Practically only the size of data matters. The smaller the table (it's files size) the faster it works. But you will not improve performance by restricting types.

Big string values used as partition keys can affect metadata DB performance, as well as the number of partitions being processed also can affect performance. Again the same: only the size of data matters here, not types.

1, 0 can be better than 'Yes', 'No' just because of size. And compression and parallelism can make this difference negligible in many cases.

Lashay answered 29/8, 2018 at 17:10 Comment(2)
when we writing data into these tables, I think partition is a kind of group by query. Group by performs will be faster on iintgers than strings. Any thoughts in this angle.Kesler
@Achyuth For example, GUID vs Int shows no difference in group by and joins. Because of high degree of parallelism and because it is being serialized/deserialized and transferred using http. And Map-reduce parallel execution adds so much random in the performance that you cannot repeat your experiment with the same result from run to run.Lashay
P
5

Well, it makes a difference if you look up the official Impala documentation.

Instead of elaborating, I will paste the section from the doc, as I think it states it quite well:

"Although it might be convenient to use STRING columns for partition keys, even when those columns contain numbers, for performance and scalability it is much better to use numeric columns as partition keys whenever practical. Although the underlying HDFS directory name might be the same in either case, the in-memory storage for the partition key columns is more compact, and computations are faster, if partition key columns such as YEAR, MONTH, DAY and so on are declared as INT, SMALLINT, and so on."

Reference: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_string.html

Purgative answered 21/11, 2018 at 7:2 Comment(2)
If you look into partition section of documentaion, it says, 'The data type of the partition columns does not have a significant effect on the storage required, because the values from those columns are not stored in the data files, rather they are represented as strings inside HDFS directory names.' link - docs.cloudera.com/documentation/enterprise/5-9-x/topics/…Couperin
Correct, it does not have much effect on the storage required on HDFS, but as stated, it has effect on the in-memory storage, and thereby the performance, if you have many partitions.Unscratched
L
4

No, there is no such recommendation. Consider this: The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed inside data files and not compressed.

Due to the distributed/parallel nature of map-reduce and Impalla, you will never notice the difference in query processing performance. Also all data will be serialized to be passed between processing stages, then again deserialized and cast to some type, this can happen many times for the same query.

There are a lot of overhead created by distributed processing and serializing/deserializing data. Practically only the size of data matters. The smaller the table (it's files size) the faster it works. But you will not improve performance by restricting types.

Big string values used as partition keys can affect metadata DB performance, as well as the number of partitions being processed also can affect performance. Again the same: only the size of data matters here, not types.

1, 0 can be better than 'Yes', 'No' just because of size. And compression and parallelism can make this difference negligible in many cases.

Lashay answered 29/8, 2018 at 17:10 Comment(2)
when we writing data into these tables, I think partition is a kind of group by query. Group by performs will be faster on iintgers than strings. Any thoughts in this angle.Kesler
@Achyuth For example, GUID vs Int shows no difference in group by and joins. Because of high degree of parallelism and because it is being serialized/deserialized and transferred using http. And Map-reduce parallel execution adds so much random in the performance that you cannot repeat your experiment with the same result from run to run.Lashay

© 2022 - 2024 — McMap. All rights reserved.