Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is potentially inconsistent, and can result in either no input files or 0 byte files being given to the pipeline or even being produced by some stages of the pipeline.
During a LOAD statement, Pig fails spectacularly if it either doesn't find any input files or any of the input files are 0 bytes.
Is there any good way to work around this (hopefully within the Pig configuration or script or the Hadoop cluster configuration, without writing a custom loader...)?
(Since we're using AWS elastic map reduce, we're stuck with Pig 0.6.0 and Hadoop 0.20.)