Hadoop MapReduce provide nested directories as job input
Asked Answered
M

5

22

I'm working on a job that processes a nested directory structure, containing files on multiple levels:

one/
├── three/
│   └── four/
│       ├── baz.txt
│       ├── bleh.txt
│       └── foo.txt
└── two/
    ├── bar.txt
    └── gaa.txt

When I add one/ as an input path, no files are processed, since none are immediately available at the root level.

I read about job.addInputPathRecursively(..), but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir), which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath()), when split.getPath() is a directory (This happens inside LineRecordReader.java).

I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?

EDIT - apparently there's an open bug on this.

Madai answered 18/4, 2012 at 13:44 Comment(5)
Is it so diffucult to use FileSystem#listStatus() and add them recursively?Heliostat
I am solving it in the similar way - wrote recursive code which traverse subdirectories and add all files to input PathsCaracul
@ThomasJungblut that's basically my current approach. I just find it odd that this functionality is not built in. Another issue I'm having is that hadoop crashes when it accesses a sub folder without any files in it, just other folders (like one and one/three in my example). So basically I need to implement logic that will add folders recursively unless they only have other folders in them, instead of files (still have to walk their content to add nested files). Seems like a lot of trouble just to set up a job.Madai
You could write a PathFilter that only accepts files, and then use the FileInputFormat.setInputPathFilter method - hadoop.apache.org/common/docs/current/api/org/apache/hadoop/…Viceregent
possible duplicate of FileStatus use to recurse directoryKinky
H
14

I didn't found any document on this but */* works. So it's -input 'path/*/*'.

Henricks answered 13/8, 2012 at 6:57 Comment(3)
u sure this isnt being expanded in bash(or your shell) and launching tons of hadoop instances?Grenoble
I have single quotes around them.Henricks
Running ps -aux would help clear the issue mentioned by @GrenobleKame
B
7

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

FileInputFormat.setInputDirRecursive(job, true);

No thanks, just call me LeiFeng!

Brout answered 31/12, 2014 at 3:43 Comment(0)
D
4

I find recursively going through data can be dangerous since there may be lingering log files from a distcp or something similar. Let me propose an alternative:

Do the recursive walk on the command line, and then pass in the paths in a space-delimited parameter into your MapReduce program. Grab the list from argv:

$ hadoop jar blah.jar "`hadoop fs -lsr recursivepath | awk '{print $8}' | grep '/data.*\.txt' | tr '\n' ' '`"

Sorry for the long bash, but it gets the job done. You could wrap the thing in a bash script to break things out into variables.

I personally like the pass-in-filepath approach to writing my mapreduce jobs so the code itself doesn't have hardcoded paths and it's relatively easy for me to set it up to run against more complex list of files.

Dustup answered 18/4, 2012 at 17:31 Comment(2)
Thanks for this. Do you know if there is any reason to do it this way vs. FileInputFormat.addInputPaths("comma seperated file from the above bash")?Tedmann
Interesting, any reason why? I'm quite new to Hadoop but ran into this -lsr problem already.Tedmann
M
2

Don't know if still relevant but at least in hadoop 2.4.0 you can set property mapreduce.input.fileinputformat.input.dir.recursive to true and it will solve your problem.

Myel answered 4/12, 2014 at 12:46 Comment(0)
P
-1

just use FileInputFormat.addInputPath("with file pattern"); i am writing my first hadoop prog for graph analysis where input is from diff dir in .gz format ... it worked for me !!!

Piker answered 27/4, 2012 at 21:49 Comment(1)
using name pattern is one way to avoid nested directory problem.Hunan

© 2022 - 2024 — McMap. All rights reserved.