I am not able to understand what this DISTRIBUTE BY
clause does in Hive. I know the definition that says, if we have DISTRIBUTE BY (city)
, this would send each city in a different reducer but I am not getting the same. Let us consider the data as follows:
Say we have a table called data with columns username and amount:
+----------+--------+
| username | amount |
+----------+--------+
| user_1 | 25 |
+----------+--------+
| user_1 | 53 |
+----------+--------+
| user_1 | 28 |
+----------+--------+
| user_1 | 50 |
+----------+--------+
| user_2 | 20 |
+----------+--------+
| user_2 | 50 |
+----------+--------+
| user_2 | 10 |
+----------+--------+
| user_2 | 5 |
+----------+--------+
Now If I say -
SELECT username, SUM(amount) FROM data DISTRIBUTE BY (username)
Shouldn't this run 2 separate reducers? It is still running a single reducer and I don't know why. I thought this may have to do with clustering into buckets or partitioning but I tried everything, and it still runs a single reducer. Can anyone explain why?