Are getCacheFiles() and getLocalCacheFiles() the same?
Asked Answered
W

1

2

As getLocalCacheFiles() is deprecated, I'm trying to find an alternative. getCacheFiles() seems to be one, but I doubt whether they are the same.

When you call addCacheFile(), the file in HDFS would be downloaded to every node and using getLocalCacheFiles() you can get the localized file path and you can read it from local file system. However, what getCacheFiles() returns is the URI of the file in HDFS. If you read file by this URI, I doubt that you still read from HDFS instead of local file system.

The above is my understanding, I don't know whether it's correct. If so, what's the alternative for getLocalCacheFiles()? And why Hadoop deprecate it in the first place?

Wish answered 21/10, 2014 at 17:44 Comment(0)
E
6

It's open source. You can always find the git blame that introduced the @Deprectated: commit 735b50e8bd23f7fbeff3a08cf8f3fff8cbff7449, which is for MAPREDUCE-4493. At the tail of the JIRA you'll find this discussion:

Omkar Vinit Joshi added a comment - 13/Jul/13 00:18
Robert Joseph Evans if we are deprecating getLocalCacheFiles and getCacheFiles in jobContext() then how the user is going to get local cached files in map task? YARN-916 is the related issue.. Thanks.

Robert Joseph Evans added a comment - 19/Jul/13 15:27
Omkar Vinit Joshi By opening the symbolic link in the current working directory. Prior to YARN the default behavior was to not create symlinks in the current working directory pointing to the items in the distributed cache. If you wanted links you had to specifically turn that option on and provide the name of the symlink you wanted. The only way to get to files without symlinks was to call getLocalCacheFiles and getCacheFiles. In YARN all files will have a symlink created. The name of the file/directory will be the name of the symlink. However, it is possible to have a name collision where I wanted hdfs://foo/bar.zip and hdfs://bar/bar.zip. In 1.0 both of these would have been downloaded and accessible through the deprecated APIs, but in YARN a warning will be output and only one of them will be downloaded. Also because of the way these APIs were written the mapper code may not know that only one of them was downloaded and will not be able to find the missing one and blow up. That is why I deprecated them in favor of nudging people to always use the symlinks so the behavior is always consistent.

Omkar Vinit Joshi added a comment - 19/Jul/13 16:56
Robert Joseph Evans sounds good.. however by this we will be putting limitation based on file name ..but that sounds reasonable considering the fact that this will stop potential bugs in map code and users can definitely version them to avoid it... Thanks...

So you're supposed to just open the file, it will be there. No dedicated API.

Exieexigency answered 21/10, 2014 at 18:17 Comment(2)
I'm using Hadoop 1.2.1, can you please explain the difference between getCacheFiles and getLocalCacheFiles?Abebi
Thanks for the answer. But can you update the answer with the code bits also as alternatives. So that it will help us to visually understand the choices we have to access the files. it would be clear.Eichler

© 2022 - 2024 — McMap. All rights reserved.