Storing results of UNION in PIG in a single file
Asked Answered
A

2

8

I have a PIG Script which produces four results I want to store all of them in a single file. I tries using UNION, however when I use UNION I get four files part-m-00000, part-m-00001, part-m-00002, part-m-00003. Cant I get a single file?

Here is the PIG script

A = UNION Message_1,Message_2,Message_3,Message_4 into 'AA';

Inside the AA folder I get 4 files as mentioned above. Can't I get a single file with all entries in it?

Aspirant answered 8/6, 2012 at 19:20 Comment(0)
R
14

Pig is doing the right thing here and is unioning the data sets. All being one file doesn't mean one data set in Hadoop... one data set in Hadoop is usually a folder. Since it doesn't need to run a reduce here, it's not going to.

You need to fool Pig to run a Map AND Reduce. The way I usually do this is:

set default_parallel 1

...
A = UNION Message_1,Message_2,Message_3,Message_4;
B = GROUP A BY 1; -- group ALL of the records together
C = FOREACH B GENERATE FLATTEN(A);
...

The GROUP BY groups all of the records together, and then the FLATTEN explodes that list back out.


One thing to note here is that this isn't much different from doing:

$ hadoop fs -cat msg1.txt msg2.txt msg3.txt msg4.txt | hadoop fs -put - union.txt

(this is concatenating all of the text, and then writing it back out to HDFS as a new file)

This isn't parallel at all, but neither is funneling all of your data through one reducer.

Rhiamon answered 9/6, 2012 at 13:26 Comment(2)
You can also do the GROUP... BY 1 parallel 1 instead of setting the global default_parallel.Burthen
After looking for so much I think this is the best workaround.Periostitis
S
1

Have you tried setting the default_parallel property?

grunt> set default_parallel 1
grunt> A = UNION Message_1,Message_2,Message_3,Message_4;
Stoss answered 8/6, 2012 at 22:27 Comment(2)
No it still gives me 4 files. I have just used the said line, grunt> set default_parallel 1 Should I do something more regarding the pig properties?Aspirant
It is a map-only job, so setting default_parallel won't work since it effects only the reduce phase. If this is part of a larger job, you can try to write the script in a way that the last job is a reduce job, and then set default_parallel to 1 before that job, then it would work.Collegian

© 2022 - 2024 — McMap. All rights reserved.