How to store grouped records into multiple files with Pig?
Asked Answered
F

2

5

After loading and grouping records, how can I store those grouped records into several files, one per group (=userid)?

records = LOAD 'input' AS (userid:int, ...);
grouped_records = GROUP records BY userid;

I'm using Apache Pig version 0.8.1-cdh3u3 (rexported)

Fertility answered 16/2, 2012 at 15:52 Comment(1)
Hmm it seems MultiStorage in Piggybank could be what I am looking for (?) svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/…Fertility
F
4
 A = LOAD 'mydata' USING PigStorage() as (a, b, c);  
 STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0', 'bz2', '\\t');

Parameters:

  1. parentPathStr - Parent output dir path
  2. splitFieldIndex - key field index
  3. compression - 'bz2', 'bz', 'gz' or 'none'
  4. fieldDel - Output record field delimiter.

Reference: GrepCode

Feat answered 15/5, 2015 at 14:31 Comment(0)
F
8

Indeed, there is a MultiStorage class at Piggybank which does exactly what I want - it splits the records by a specified attribute (at index '0' in my example):

STORE records INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0', 'none', ',');
Fertility answered 17/2, 2012 at 9:15 Comment(2)
Do you know how to do the same but instead of specifying a compression format, I want to store my files in RC Format?Collimate
Sorry Emtiaz, I don't know.Fertility
F
4
 A = LOAD 'mydata' USING PigStorage() as (a, b, c);  
 STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0', 'bz2', '\\t');

Parameters:

  1. parentPathStr - Parent output dir path
  2. splitFieldIndex - key field index
  3. compression - 'bz2', 'bz', 'gz' or 'none'
  4. fieldDel - Output record field delimiter.

Reference: GrepCode

Feat answered 15/5, 2015 at 14:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.