2 basic questions that trouble me:
- How can I be sure that each of the 32 files hive uses to store my tables sits on its unique machine?
- If that happens, how can I be sure that if hive creates 32 mappers, each of them will work on its local data? Does hadoop/hdfs guarantees this magic, or does hive as a smart application makes sure that it will happen?
Background: I have a hive cluster of 32 machines, and:
- All my tables are created with
"CLUSTERED BY(MY_KEY) INTO 32 BUCKETS"
- I use
hive.enforce.bucketing = true;
- I verified and indeed every table is stored as 32 files in the user/hive/warehouse
- I'm using HDFS replication factor of 2
Thanks!