Life of distributed cache in Hadoop

About

Asked 19/12, 2010 at 15:57 Answered 21/12, 2010 at 21:31

Solved hadoop amazon-web-services elastic-map-reduce

When files are transferred to nodes using the distributed cache mechanism in a Hadoop streaming job, does the system delete these files after a job is completed? If they are deleted, which i presume they are, is there a way to make the cache remain for multiple jobs? Does this work the same way on Amazon's Elastic Mapreduce?

Astrahan answered 19/12, 2010 at 15:57 Comment(0)

I was digging around in the source code, and it looks like files are deleted by TrackerDistributedCacheManager about once a minute when their reference count drops to zero. The TaskRunner explicitly releases all its files at the end of a task. Maybe you should edit TaskRunner to not do this, and control the cache through more explicit means yourself?

Slaton answered 20/12, 2010 at 15:18 Comment(1)

That's a huge help. I think there might be other ways to get files loaded onto the nodes which I will explore. Distributed cache was just the method I was familiar with. Thanks for the code ref, that's amazingly helpful. – Astrahan 20/12, 2010 at 15:55

I cross posted this question at the AWS forum and got a good recommendation to use hadoop fs -get to transfer files in a way that persists across jobs.

Astrahan answered 21/12, 2010 at 21:31 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags