DISTRIBUTE BY clause in HIVE

Asked 14/2, 2017 at 18:51 Answered 14/2, 2017 at 19:54

I am not able to understand what this DISTRIBUTE BY clause does in Hive. I know the definition that says, if we have DISTRIBUTE BY (city), this would send each city in a different reducer but I am not getting the same. Let us consider the data as follows:

Say we have a table called data with columns username and amount:

+----------+--------+
| username | amount |
+----------+--------+
| user_1   | 25     |
+----------+--------+
| user_1   | 53     |
+----------+--------+
| user_1   | 28     |
+----------+--------+
| user_1   | 50     |
+----------+--------+
| user_2   | 20     |
+----------+--------+
| user_2   | 50     |
+----------+--------+
| user_2   | 10     |
+----------+--------+
| user_2   | 5      |
+----------+--------+

Now If I say -

SELECT username, SUM(amount) FROM data DISTRIBUTE BY (username)

Shouldn't this run 2 separate reducers? It is still running a single reducer and I don't know why. I thought this may have to do with clustering into buckets or partitioning but I tried everything, and it still runs a single reducer. Can anyone explain why?

Faithless answered 14/2, 2017 at 18:51 Comment(0)

The only thing DISTRIBUTE BY (city) says is that records with the same city will go to the same reducer. Nothing else.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

A question by the OP:

Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?

For 2 reasons:

In the beginning of hive DISTRIBUTE BY, SORT BY and CLUSTER BY where used to process data in a way that today is being done automatically (e.g. analytic functions https://oren.lederman.name/?p=32)
You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY + SORT BY or CLUSTER BY. With DISTRIBUTE BY it is guaranteed that you'll have the whole group in the same reducer. With SORT BY that you'll get all the records of a group continuously.

Tessitura answered 14/2, 2017 at 19:11 Comment(8)

As in if I say there are four cities A,B,C and D , A should go a different reducer , B to different and so on ? – Faithless 15/2, 2017 at 7:6

No. All A would go to 1 reducer. All B will go to 1 reducer You have no guarantee that these are going to be different reducers. – Dowdell 15/2, 2017 at 7:9

All right so we can say at a time a reducer will contain only (1) kind of city ? – Faithless 15/2, 2017 at 7:10

Also no. Every record that is being read by a mapper is being copied to one of the reducers, decided by Hash function(s) over the distribution value(s), in this case city and this take place only after the number of reducers is being decided. – Dowdell 15/2, 2017 at 7:15

Seriously I don't get it. Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ? – Faithless 15/2, 2017 at 13:7

For 2 reasons (1) In the beginning of hive DISTRIBUTE BY, SORT BY and CLUSTER BY where used to process data in a way that today is being done automatically (e.g. analytic functions oren.lederman.name/?p=32) (2) You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY + SORT BY or CLUSTER BY. With DISTRIBUTE BY it is guaranteed that you'll have the whole group in the same reducer. With SORT BY that you'll get all the records of a group continuously. – Dowdell 15/2, 2017 at 16:9

Finally got it. Thank you very much ! – Faithless 17/2, 2017 at 17:1

Hey @DavidדודוMarkovitz I used DISTRIBUTE BY clause as you said... But what if my tables don't have partitions? Then how can I make sure that my files goes into single reducer? I am trying to address the small files issue using insert overwrite of hive to merge them into bigger one.. For partitioned table distribute by seems to work fine but what when I don't have any partition columns? – Paroicous 21/6, 2018 at 6:32

In addition to @Dudu's answer, the Distribute By only distributes the rows among the reducers which is determined from the input size.

The number of reducers to be used for a Hive job will be determined by this property hive.exec.reducers.bytes.per.reducer which is dependent on the input.

As of Hive 0.14, if the input is < 256MB, only one reducer (one reducer per 256MB of input) will be used unless the number of reducers is overridden by hive.exec.reducers.max or mapred.reduce.tasks properties.

Schoolmistress answered 14/2, 2017 at 19:54 Comment(3)

So If I want a different reducer for say each (city) , I am supposed to know the number of DISTINCT cities right ? – Faithless 15/2, 2017 at 7:8

No. It is clear that the number of the reducers must be greater or equal to the number of cities in order to have each city on a different reducer, but nothing guarantees it. It is a hash function and theoretically you can have 10 cities and 100 reducers and still all the cities will be on a single reducer. – Dowdell 15/2, 2017 at 7:48

Hiii @Schoolmistress both hive.exec.reducers.max and mapred.reduce.tasks doesnt seem to work. I want to set the no. of reducers to 1, so that all files go into one reducer and get merged as a single one. Since, my table does not have partitions I am not able to use DISTRIBUTE BY clause to send files of single partition into one reducer. Do you anyway I can set no. of reducers to 1?? – Paroicous 20/6, 2018 at 14:29

Recommended topics

Hot tags