Hadoop MapReduce provide nested directories as job input

Asked 18/4, 2012 at 13:44 Answered 31/12, 2014 at 3:43

I'm working on a job that processes a nested directory structure, containing files on multiple levels:

one/
├── three/
│   └── four/
│       ├── baz.txt
│       ├── bleh.txt
│       └── foo.txt
└── two/
    ├── bar.txt
    └── gaa.txt

When I add one/ as an input path, no files are processed, since none are immediately available at the root level.

I read about job.addInputPathRecursively(..), but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir), which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath()), when split.getPath() is a directory (This happens inside LineRecordReader.java).

I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?

EDIT - apparently there's an open bug on this.

Madai answered 18/4, 2012 at 13:44 Comment(5)

Is it so diffucult to use FileSystem#listStatus() and add them recursively? – Heliostat 18/4, 2012 at 13:53

I am solving it in the similar way - wrote recursive code which traverse subdirectories and add all files to input Paths – Caracul 18/4, 2012 at 19:2

@ThomasJungblut that's basically my current approach. I just find it odd that this functionality is not built in. Another issue I'm having is that hadoop crashes when it accesses a sub folder without any files in it, just other folders (like one and one/three in my example). So basically I need to implement logic that will add folders recursively unless they only have other folders in them, instead of files (still have to walk their content to add nested files). Seems like a lot of trouble just to set up a job. – Madai 19/4, 2012 at 4:6

You could write a PathFilter that only accepts files, and then use the FileInputFormat.setInputPathFilter method - hadoop.apache.org/common/docs/current/api/org/apache/hadoop/… – Viceregent 10/5, 2012 at 3:26

possible duplicate of FileStatus use to recurse directory – Kinky 10/11, 2013 at 5:25

I didn't found any document on this but */* works. So it's -input 'path/*/*'.

Henricks answered 13/8, 2012 at 6:57 Comment(3)

u sure this isnt being expanded in bash(or your shell) and launching tons of hadoop instances? – Grenoble 8/6, 2013 at 6:50

I have single quotes around them. – Henricks 8/6, 2013 at 15:5

Running ps -aux would help clear the issue mentioned by @Grenoble – Kame 14/9, 2015 at 21:9

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

FileInputFormat.setInputDirRecursive(job, true);

No thanks, just call me LeiFeng!

Brout answered 31/12, 2014 at 3:43 Comment(0)

I find recursively going through data can be dangerous since there may be lingering log files from a distcp or something similar. Let me propose an alternative:

Do the recursive walk on the command line, and then pass in the paths in a space-delimited parameter into your MapReduce program. Grab the list from argv:

$ hadoop jar blah.jar "`hadoop fs -lsr recursivepath | awk '{print $8}' | grep '/data.*\.txt' | tr '\n' ' '`"

Sorry for the long bash, but it gets the job done. You could wrap the thing in a bash script to break things out into variables.

I personally like the pass-in-filepath approach to writing my mapreduce jobs so the code itself doesn't have hardcoded paths and it's relatively easy for me to set it up to run against more complex list of files.

Dustup answered 18/4, 2012 at 17:31 Comment(2)

Thanks for this. Do you know if there is any reason to do it this way vs. FileInputFormat.addInputPaths("comma seperated file from the above bash")? – Tedmann 1/8, 2012 at 22:32

Interesting, any reason why? I'm quite new to Hadoop but ran into this -lsr problem already. – Tedmann 2/8, 2012 at 1:28

Don't know if still relevant but at least in hadoop 2.4.0 you can set property mapreduce.input.fileinputformat.input.dir.recursive to true and it will solve your problem.

Myel answered 4/12, 2014 at 12:46 Comment(0)

-1

just use FileInputFormat.addInputPath("with file pattern"); i am writing my first hadoop prog for graph analysis where input is from diff dir in .gz format ... it worked for me !!!

Piker answered 27/4, 2012 at 21:49 Comment(1)

using name pattern is one way to avoid nested directory problem. – Hunan 8/5, 2014 at 12:40

Recommended topics

Hot tags