Confusion about distributed cache in Hadoop

About

Asked 20/5, 2014 at 5:31 Answered 20/5, 2014 at 8:9

Solved caching hadoop hive distributed-cache

What does the distribute cache actually mean? Having a file in distributed cache means that is it available in every datanode and hence there will be no internode communication for that data, or does it mean that the file is in memory in every node? If not, by what means can I have a file in memory for the entire job? Can this be done both for map-reduce, as well as for a UDF..

(In particular there is some configuration data, comparatively small that I would like to keep in memory as a UDF applies on hive query...? )

Thanks and regards, Dhruv Kapur.

Cravens answered 20/5, 2014 at 5:31 Comment(0)

DistributedCache is a facility provided by the Map-Reduce framework to cache files needed by applications. Once you cache a file for your job, hadoop framework will make it available on each and every data nodes (in file system, not in memory) where you map/reduce tasks are running. Then you can access the cache file as local file in your Mapper Or Reducer job. Now you can easily read the cache file and populate some collection (e.g Array, Hashmap etc.) in your code.

Refer https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/filecache/DistributedCache.html

Let me know if still you have some questions.

You can read the cache file as local file in your UDF code. After reading the file using JAVA APIs just populate any collection (In memory).

Refere URL http://www.lichun.cc/blog/2013/06/use-a-lookup-hashmap-in-hive-script/

-Ashish

Somewise answered 20/5, 2014 at 8:9 Comment(9)

Hey thanks for the reply... So I am more concerned about when and how do we put something from distributed cache into memory? In case of Hive, I'll need access to this file in the distributed cache inside a UDF. How do I get it there? I should not be reading a HDFS file from inside a UDF right? – Cravens 20/5, 2014 at 16:45

I've modified the post. Just refer the mentioned URL. – Somewise 21/5, 2014 at 6:31

That's exactly what I am searching for. Thanks! I am still a little concerned if the map inside the UDF is only populated once or not. Is there some documentation of hive supporting this, or some way that I can verify this behavior? – Cravens 21/5, 2014 at 16:16

Distributed cache concept works in very same way for all hadoop map-reduce, pig, hive etc. In your mapper/reducer function just populate a collection before iterating records to process. In this way you map populates only once for a map/reduce task. – Somewise 22/5, 2014 at 15:30

Are these files read only? Can they be modified by any of the mappers or reducers? – Solvent 14/3, 2017 at 18:33

These files can be treated as any local files (need to check for the write permission). But why do you want to modify. What's your use-case? – Somewise 17/4, 2017 at 11:21

After adding a file, I am trying to access it in my UDF but is failing with file not found. If I run list command on hive shell than I am able to get it under - "mnt/tmp/<A long string>resource/" dir but UDF is expecting at - "/user/hadoop". I am using EMR. Am I missing something ? – Dyspeptic 7/1, 2018 at 15:28

First link is Broken – Tamikotamil 11/10, 2018 at 7:54

Updated the post for new link – Somewise 12/10, 2018 at 9:43

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags