Using pig, how do I parse a mixed format line into tuples and a bag of tuples?
Asked Answered
U

1

6

I'm new to pig, and I'm having an issue parsing my input and getting it into a format that I can use. The input file contains lines that have both fixed fields and KV pairs as follows:

FF1|FF2|FF3|FF4|KVP1|KVP2|...|KVPn

My goal here is to count the number of unique fixed field combinations for each of the KV Pairs. So considering the following input lines:

1|2|3|4|key1=value1|key2=value2
2|3|4|5|key1=value7|key2=value2|key3=value3

When I'm done, I'd like to be able to generate the following results (the output format doesn't really matter at this point, I'm just showing you what I'd like the results to be):

key1=value1 : 1
key1=value7 : 1
key2=value2 : 2
key3=value3 : 1

It seems like I should be able to do this by grouping the fixed fields and flattening a bag of the KV Pairs to generate the cross product

I've tried reading this in with something like:

data = load 'myfile' using PigStorage('|');
A = foreach data generate $0 as ff1:chararray, $1 as ff2:long, $2 as ff3:chararray, $3 as ff4:chararray, TOBAG($4..) as kvpairs:bag{kvpair:tuple()};
B = foreach A { sorted = order A by ff2; lim = limit sorted 1; generate group.ff1, group.ff4, flatten( lim.kvpairs ); };
C = filter B by ff3 matches 'somevalue';
D = foreach C generate ff1, ff4, flatten( kvpairs ) as kvpair;
E = group D by (ff1, ff4, kvpair);
F = foreach E generate group, COUNT(E);

This generates records with a schema as follows:

A: {date: long,hms: long,id: long,ff1: chararray,ff2: long,ff3: chararray,ff4: chararray,kvpairs: {kvpair: (NULL)}}

While this gets me the schema that I want, there are several problems that I can't seem to solve:

  1. By using the TOBAG with .., no schema can be applied to my kvpairs, so I can't ever filter on kvpair, and I don't seem to be able to cast this at any point, so it's an all or nothing query.
  2. The filter in statement 'C' seems to return no data regardless of what value I use, even if I use something like '.*' or '.+'. I don't know if this is because there is no schema, or if this is actually a bug in pig. If I dump some data from statement B, I definitely see data there that would match those expressions.

So I've tried approaching the problem differently, by loading the data using:

data = load 'myfile' using PigStorage('\n') as (line:chararray);
init_parse = foreach data generate FLATTEN( STRSPLIT( line, '\\|', 4 ) ) as (ff1:chararray, ff2:chararray, ff3:chararray, ff4:chararray, kvpairsStr:chararray);
A = foreach mc_bk_data generate ff1, ff2, ff3, ff4, TOBAG( STRSPLIT( kvpairsStr, '\\|', 500 ) ) as kvpairs:bag{t:(kvpair:chararray)};

The issue here is that the TOBAG( STRSPLIT( ... ) ) results in a bag of a single tuple, with each of the kvpairs being a field in that tuple. I really need the bag to contain, each of the individual kvpairs as a tuple of one field so that when I flatten the bag, I get the cross product of the bag and the group that I'm interested in.

I'm open to other ways of attacking this problem as well, but I can seem to find good way to transform my tuple of multiple fields into a bag of tuples, with each tuple having one field each.

I'm using Apache Pig version 0.11.1.1.3.0.0-107

Thanks in advance.

Unlucky answered 27/9, 2013 at 18:0 Comment(0)
A
2

Your second approach is on the right track. Unfortunately, you'll need a UDF to convert a tuple to a bag, and as far as I know there is no builtin to do this. It's a simple matter to write one, however.

You won't want to group on the fixed fields, but rather on the key-value pairs themselves. So you only need to keep the tuple of key-value pairs; you can completely ignore the fixed fields.

The UDF is pretty simple. In Java, you can just do something like this in your exec method:

DataBag b = new DefaultDataBag();
Tuple t = (Tuple) input.get(0);
for (int i = 0; i < t.size(); i++) {
    Object o = t.get(i);
    Tuple e = TupleFactory.getInstance().createTuple(o);
    b.add(e);
}

return b;

Once you have that, turn the tuple from STRSPLIT into a bag, flatten it, and then do the grouping and counting.

Auspicate answered 27/9, 2013 at 19:13 Comment(1)
Thanks very much for the reply. I did manage to find the TOKENIZE() method, which I can use instead of STRSPLIT and will return me exactly what I want (a bag of tuples with one field). Your answer is definitely more general purpose than the situational TOKENIZE, so I will accept this answer. Thanks again!Unlucky

© 2022 - 2024 — McMap. All rights reserved.