How does hive/hadoop assures that each mapper works on data that is local for it?
Asked Answered
F

2

6

2 basic questions that trouble me:

  • How can I be sure that each of the 32 files hive uses to store my tables sits on its unique machine?
  • If that happens, how can I be sure that if hive creates 32 mappers, each of them will work on its local data? Does hadoop/hdfs guarantees this magic, or does hive as a smart application makes sure that it will happen?

Background: I have a hive cluster of 32 machines, and:

  • All my tables are created with "CLUSTERED BY(MY_KEY) INTO 32 BUCKETS"
  • I use hive.enforce.bucketing = true;
  • I verified and indeed every table is stored as 32 files in the user/hive/warehouse
  • I'm using HDFS replication factor of 2

Thanks!

Fausta answered 4/8, 2011 at 12:56 Comment(0)
C
5
  1. The data placement is determined by HDFS. It will try to balance bytes over machines. Due to replicate each file will be on two machines, which means you have two candidate machines for reading the data locally.
  2. HDFS knows where each files is stored, and Hadoop uses this information to place mappers on the same hosts as the data is stored. You can look at the counters for your job to see "data local" and "rack local" map task counts. This is a feature of Hadoop that you don't need to worry about.
Customer answered 4/8, 2011 at 22:46 Comment(3)
By default HDFS replicates blocks three times (same node, and two other nodes, preferably in another rack).Rexferd
ok, thanks, in light of your answer I rephrased and asked a new question that better describes my problem: #6953883Fausta
@SpikeGronim would you be able to provide insight into this Hadoop question? Is it possible to restrict a MapReduce job from accessing remote data?Rossner
H
1

Without joins, usual Hadoop Map Reduce mechanism for data locality is used (it is described in Spike's answer).
Specifically for the hive I would mention Map joins. It is possible to tell hive what is maximum size of the table for the Map only join. When one of the tables is small enough then Hive will replicate the this table to all nodes using distributed cache mechanism, and ensure that all the join process happens locally to data. There is good explanation of the process: http://www.facebook.com/note.php?note_id=470667928919

Horrocks answered 5/8, 2011 at 8:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.