I'm working on a job that processes a nested directory structure, containing files on multiple levels:
one/
├── three/
│ └── four/
│ ├── baz.txt
│ ├── bleh.txt
│ └── foo.txt
└── two/
├── bar.txt
└── gaa.txt
When I add one/
as an input path, no files are processed, since none are immediately available at the root level.
I read about job.addInputPathRecursively(..)
, but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir)
, which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath())
, when split.getPath()
is a directory (This happens inside LineRecordReader.java
).
I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?
EDIT - apparently there's an open bug on this.
FileSystem#listStatus()
and add them recursively? – Heliostatone
andone/three
in my example). So basically I need to implement logic that will add folders recursively unless they only have other folders in them, instead of files (still have to walk their content to add nested files). Seems like a lot of trouble just to set up a job. – Madai