Group key value of map in pig
Asked Answered
C

1

6

I am new to pigscript. Say, We have a file

[a#1,b#2,c#3]
[a#4,b#5,c#6]
[a#7,b#8,c#9]

pig script

A = LOAD 'txt' AS (in: map[]);
B = FOREACH A GENERATE in#'a';
DUMP B;

We know that we can take the values feeding in the key. In the above example I took the map that contains the values with respect to the key "a". Assuming that I dont know the key, I want to group the values with respect to keys in a relation and dump it.

(a,{1,4,7})
(b,{2,5,8})
(c,{3,6,9})    

Does pig allows such operations or need to go with UDF? Please help me through this. Thanks.

Cascade answered 18/9, 2012 at 12:21 Comment(0)
H
4

You can create a custom UDF which converts the map to a bag (using Pig v0.10.0):

package com.example;

import java.io.IOException;
import java.util.Map;
import java.util.Map.Entry;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;

public class MapToBag extends EvalFunc<DataBag> {

    private static final BagFactory bagFactory = BagFactory.getInstance();
    private static final TupleFactory tupleFactory = TupleFactory.getInstance();

    @Override
    public DataBag exec(Tuple input) throws IOException {
        try {
            @SuppressWarnings("unchecked")
            Map<String, Object> map = (Map<String, Object>) input.get(0);
            DataBag result = null;
            if (map != null) {
                result = bagFactory.newDefaultBag();
                for (Entry<String, Object> entry : map.entrySet()) {
                    Tuple tuple = tupleFactory.newTuple(2);
                    tuple.set(0, entry.getKey());
                    tuple.set(1, entry.getValue());
                    result.add(tuple);
                }
            }
            return result;

        }
        catch (Exception e) {
            throw new RuntimeException("MapToBag error", e);
        }
    }
}

Then:

B = foreach A generate 
      flatten(com.example.MapToBag(in)) as (k:chararray, v:chararray);
describe B;
B: {k: chararray,v: chararray}

Now group by key and use a nested foreach:

C = foreach (group B by k) {
    value = foreach B generate v;
    generate group as key, value;
};
dump C;
(a,{(1),(4),(7)})
(b,{(2),(5),(8)})
(c,{(3),(6),(9)})
Houseboat answered 22/9, 2012 at 9:55 Comment(6)
I don't think the final nested foreach will work... as FOREACH isn't allowed inside FOREACH as of v0.9Aenneea
This sample was done in v0.10.0 in which it is allowed. Anyway, I updated my answer to indicate the Pig version I used.Houseboat
I guess I'm going to have to upgrade to v0.10 since I would really like to select columns in a nested foreach. :)Aenneea
@LorandBendig, I am also new in PIG and have same kind of scenario, I tried this but while executing I get error 2015-01-03 11:21:24,655 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve mupigudf.MyPigUDF using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.] can you please suggest me what is going wrong, I have googled also but not understand why it comes.Banderilla
@Banderilla Didn't you forget to register the jar which contains MyPigUDF?Houseboat
@LorandBending I have registered MyPigUDF.jar using this line, REGISTER /home/megabytes/hadoop-work/PIG/small-data/MyFirstUDF.jarBanderilla

© 2022 - 2024 — McMap. All rights reserved.