DISTRIBUTE BY clause in HIVE
Asked Answered
F

2

13

I am not able to understand what this DISTRIBUTE BY clause does in Hive. I know the definition that says, if we have DISTRIBUTE BY (city), this would send each city in a different reducer but I am not getting the same. Let us consider the data as follows:

Say we have a table called data with columns username and amount:

+----------+--------+
| username | amount |
+----------+--------+
| user_1   | 25     |
+----------+--------+
| user_1   | 53     |
+----------+--------+
| user_1   | 28     |
+----------+--------+
| user_1   | 50     |
+----------+--------+
| user_2   | 20     |
+----------+--------+
| user_2   | 50     |
+----------+--------+
| user_2   | 10     |
+----------+--------+
| user_2   | 5      |
+----------+--------+

Now If I say -

SELECT username, SUM(amount) FROM data DISTRIBUTE BY (username)

Shouldn't this run 2 separate reducers? It is still running a single reducer and I don't know why. I thought this may have to do with clustering into buckets or partitioning but I tried everything, and it still runs a single reducer. Can anyone explain why?

Faithless answered 14/2, 2017 at 18:51 Comment(0)
T
18

The only thing DISTRIBUTE BY (city) says is that records with the same city will go to the same reducer. Nothing else.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy


A question by the OP:

Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?


For 2 reasons:

  1. In the beginning of hive DISTRIBUTE BY, SORT BY and CLUSTER BY where used to process data in a way that today is being done automatically (e.g. analytic functions https://oren.lederman.name/?p=32)

  2. You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY + SORT BY or CLUSTER BY. With DISTRIBUTE BY it is guaranteed that you'll have the whole group in the same reducer. With SORT BY that you'll get all the records of a group continuously.

Tessitura answered 14/2, 2017 at 19:11 Comment(8)
As in if I say there are four cities A,B,C and D , A should go a different reducer , B to different and so on ?Faithless
No. All A would go to 1 reducer. All B will go to 1 reducer You have no guarantee that these are going to be different reducers.Dowdell
All right so we can say at a time a reducer will contain only (1) kind of city ?Faithless
Also no. Every record that is being read by a mapper is being copied to one of the reducers, decided by Hash function(s) over the distribution value(s), in this case city and this take place only after the number of reducers is being decided.Dowdell
Seriously I don't get it. Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?Faithless
For 2 reasons (1) In the beginning of hive DISTRIBUTE BY, SORT BY and CLUSTER BY where used to process data in a way that today is being done automatically (e.g. analytic functions oren.lederman.name/?p=32) (2) You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY + SORT BY or CLUSTER BY. With DISTRIBUTE BY it is guaranteed that you'll have the whole group in the same reducer. With SORT BY that you'll get all the records of a group continuously.Dowdell
Finally got it. Thank you very much !Faithless
Hey @DavidדודוMarkovitz I used DISTRIBUTE BY clause as you said... But what if my tables don't have partitions? Then how can I make sure that my files goes into single reducer? I am trying to address the small files issue using insert overwrite of hive to merge them into bigger one.. For partitioned table distribute by seems to work fine but what when I don't have any partition columns?Paroicous
S
6

In addition to @Dudu's answer, the Distribute By only distributes the rows among the reducers which is determined from the input size.

The number of reducers to be used for a Hive job will be determined by this property hive.exec.reducers.bytes.per.reducer which is dependent on the input.

As of Hive 0.14, if the input is < 256MB, only one reducer (one reducer per 256MB of input) will be used unless the number of reducers is overridden by hive.exec.reducers.max or mapred.reduce.tasks properties.

Schoolmistress answered 14/2, 2017 at 19:54 Comment(3)
So If I want a different reducer for say each (city) , I am supposed to know the number of DISTINCT cities right ?Faithless
No. It is clear that the number of the reducers must be greater or equal to the number of cities in order to have each city on a different reducer, but nothing guarantees it. It is a hash function and theoretically you can have 10 cities and 100 reducers and still all the cities will be on a single reducer.Dowdell
Hiii @Schoolmistress both hive.exec.reducers.max and mapred.reduce.tasks doesnt seem to work. I want to set the no. of reducers to 1, so that all files go into one reducer and get merged as a single one. Since, my table does not have partitions I am not able to use DISTRIBUTE BY clause to send files of single partition into one reducer. Do you anyway I can set no. of reducers to 1??Paroicous

© 2022 - 2024 — McMap. All rights reserved.