Pig is doing the right thing here and is unioning the data sets. All being one file doesn't mean one data set in Hadoop... one data set in Hadoop is usually a folder. Since it doesn't need to run a reduce here, it's not going to.
You need to fool Pig to run a Map AND Reduce. The way I usually do this is:
set default_parallel 1
...
A = UNION Message_1,Message_2,Message_3,Message_4;
B = GROUP A BY 1; -- group ALL of the records together
C = FOREACH B GENERATE FLATTEN(A);
...
The GROUP BY
groups all of the records together, and then the FLATTEN
explodes that list back out.
One thing to note here is that this isn't much different from doing:
$ hadoop fs -cat msg1.txt msg2.txt msg3.txt msg4.txt | hadoop fs -put - union.txt
(this is concatenating all of the text, and then writing it back out to HDFS as a new file)
This isn't parallel at all, but neither is funneling all of your data through one reducer.
GROUP... BY 1 parallel 1
instead of setting the globaldefault_parallel
. – Burthen