How do I store gzipped files using PigStorage in Apache Pig?
Asked Answered
C

3

10

Apache Pig v0.7 can read gzipped files with no extra effort on my part, e.g.:

MyData = LOAD '/tmp/data.csv.gz' USING PigStorage(',') AS (timestamp, user, url);

I can process that data and output it to disk okay:

PerUser = GROUP MyData BY user;
UserCount = FOREACH PerUser GENERATE group AS user, COUNT(MyData) AS count;
STORE UserCount INTO '/tmp/usercount' USING PigStorage(',');

But the output file isn't compressed:

/tmp/usercount/part-r-00000

Is there a way of telling the STORE command to output content in gzip format? Note that ideally I'd like an answer applicable for Pig 0.6 as I wish to use Amazon Elastic MapReduce; but if there's a solution for any version of Pig I'd like to hear it.

Czarism answered 11/2, 2011 at 12:12 Comment(0)
I
14

There are two ways:

  1. As mentioned above in the storage you can say the output directory as

    usercount.gz STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');

  2. Set compression method in your script.

    set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;

Illtimed answered 27/11, 2012 at 10:48 Comment(0)
W
10

For Pig r0.8.0 the answer is as simple as giving your output path an extension of ".gz" (or ".bz" should you prefer bzip).

The last line of your code should be amended to read:

STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');

Per your example, your output file would then be found as

/tmp/usercount.gz/part-r-00000.gz

For more information, see: https://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#PigStorage

Waldheim answered 24/2, 2011 at 5:34 Comment(2)
Great answer. Unfortunately Amazon Elastic Map-Reduce only supports Pig v0.6.Czarism
FYI: EMR is running Pig version 0.9.2 by default currently, so this should work now.Overpraise
T
3

According to the Pig documentation for PigStorage, there are 2 ways to do this

Specifying the compression format using the 'STORE' statement

STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');
STORE UserCount INTO '/tmp/usercount.bz2' USING PigStorage(',');
STORE UserCount INTO '/tmp/usercount.lzo' USING PigStorage(',');

Notice the above statements. Pig supports 3 compression formats, i.e GZip, BZip2 and LZO. For getting LZO to work you have to install it separately. See here for more information about lzo.

Specifying compression via job properties

By setting the following properties in your pig script, i.e output.compression.enabled and output.compression.codec via the following code

set output.compression.enabled true;

and

set output.compression.codec com.hadoop.compression.lzo.LzopCodec;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;
Tashia answered 12/5, 2015 at 5:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.