Windowing function in Hive
Asked Answered
A

1

5

I am exploring windowing functions in Hive and I am able to understand the functionalities of all the UDFs. Although, I am not able to understand the partition by and order by that we use with the other functions. Following is the structure that is very similar to the query which I am planning to build.

SELECT a, RANK() OVER(partition by b order by c) as d from xyz; 

Just trying to understand the background process involved for both keywords.

Appreciate the help :)

Anselmi answered 29/4, 2019 at 18:34 Comment(0)
C
11

RANK() analytic function assigns a rank to each row in each partition in the dataset.

PARTITION BY clause determines how the rows to be distributed (between reducers if it is hive).

ORDER BY determines how the rows are being sorted in the partition.

First phase is distribute by, all rows in a dataset are distributed into partitions. In map-reduce each mapper groups rows according to the partition by and produces files for each partition. Mapper does initial sorting of partition parts according to the order by.

Second phase, all rows are sorted inside each partition. In map-reduce, each reducer gets partitions files (parts of partitions) produced by mappers and sorts rows in the whole partition (sort of partial results) according to the order by.

Third, rank function assigns rank to each row in a partition. Rank function is being initialized for each partition.

For the first row in the partition rank starts with 1. For each next row Rank=previous row rank+1. Rows with equal values (specified in the order by) given the same rank, if the two rows share the same rank, next row rank is not consecutive.

Different partitions can be processed in parallel on different reducers. Small partitions can be processed on the same reducer. Rank function re-initializes when it crossing the partition boundary and starts with rank=1 for each partition.

Example (rows are already partitioned and sorted inside partitions):

SELECT a, RANK() OVER(partition by b order by c) as d from xyz; 

a, b, c, d(rank)
----------------
1  1  1  1 --starts with 1
2  1  1  1 --the same c value, the same rank=1
3  1  2  3 --rank 2 is skipped because second row shares the same rank as first 

4  2  3  1 --New partition starts with 1
5  2  4  2
6  2  5  3

If you need consecutive ranks, use dense_rank function. dense_rank will produce rank=2 for the third row in the above dataset.

row_number function will assign a position number to each row in the partition starting with 1. Rows with equal values will receive different consecutive numbers.

SELECT a, ROW_NUMBER() OVER(partition by b order by c) as d from xyz; 

a, b, c, d(row_number)
----------------
1  1  1  1 --starts with 1
2  1  1  2 --the same c value, row number=2
3  1  2  3 --row position=3

4  2  3  1 --New partition starts with 1
5  2  4  2
6  2  5  3

Important note: For rows with the same values row_number or other such analytic function may have non-deterministic behavior and produce different numbers from run to run. First row in the above dataset may receive number 2 and second row may receive number 1 and vice-versa, because their order is not determined unless you will add one more column a to the order by clause. In this case all rows will always have the same row_number from run to run, their order values are different.

Clavichord answered 29/4, 2019 at 19:52 Comment(5)
thanks for the detailed explaination. Just a small add on question, what if the table is already partitioned and ordered by some keys? Does this apply or am i missing something?Anselmi
Table partitions in general have nothing in common with analytic function partition by clause. Same table partitioning may improve efficiency, mapper will process the same partition, producing less number of output files. Table is not ordered by definition in Codd's theory. Only order by guarantees the order due to parallelism.You can order to improve efficiency of packing ORC or Parquet files, improve internal indexes efficiency, improve file compression, order by during insert has nothing in common with order of rows returned by query.https://mcmap.net/q/2029423/-hive-order-by-not-visible-columnClavichord
@satishsilveri And of course table partitioning improves efficiency of filtering. But table data is being read in parallel and mappers and reducers are isolated from each other, not synchronized, emitting rows independently in parallel. This is why the result of query may be not ordered even if the data is ordered in the table files.Clavichord
Mapper does initial sorting of partitions parts - In the first phase is this also as per the field mentioned in order by?Fredricfredrick
@Fredricfredrick Yes. Partition parts processed by mapper are sorted according to order by.Clavichord

© 2022 - 2024 — McMap. All rights reserved.