I'm new to pig, and I'm having an issue parsing my input and getting it into a format that I can use. The input file contains lines that have both fixed fields and KV pairs as follows:
FF1|FF2|FF3|FF4|KVP1|KVP2|...|KVPn
My goal here is to count the number of unique fixed field combinations for each of the KV Pairs. So considering the following input lines:
1|2|3|4|key1=value1|key2=value2
2|3|4|5|key1=value7|key2=value2|key3=value3
When I'm done, I'd like to be able to generate the following results (the output format doesn't really matter at this point, I'm just showing you what I'd like the results to be):
key1=value1 : 1
key1=value7 : 1
key2=value2 : 2
key3=value3 : 1
It seems like I should be able to do this by grouping the fixed fields and flattening a bag of the KV Pairs to generate the cross product
I've tried reading this in with something like:
data = load 'myfile' using PigStorage('|');
A = foreach data generate $0 as ff1:chararray, $1 as ff2:long, $2 as ff3:chararray, $3 as ff4:chararray, TOBAG($4..) as kvpairs:bag{kvpair:tuple()};
B = foreach A { sorted = order A by ff2; lim = limit sorted 1; generate group.ff1, group.ff4, flatten( lim.kvpairs ); };
C = filter B by ff3 matches 'somevalue';
D = foreach C generate ff1, ff4, flatten( kvpairs ) as kvpair;
E = group D by (ff1, ff4, kvpair);
F = foreach E generate group, COUNT(E);
This generates records with a schema as follows:
A: {date: long,hms: long,id: long,ff1: chararray,ff2: long,ff3: chararray,ff4: chararray,kvpairs: {kvpair: (NULL)}}
While this gets me the schema that I want, there are several problems that I can't seem to solve:
- By using the TOBAG with .., no schema can be applied to my kvpairs, so I can't ever filter on kvpair, and I don't seem to be able to cast this at any point, so it's an all or nothing query.
- The filter in statement 'C' seems to return no data regardless of what value I use, even if I use something like '.*' or '.+'. I don't know if this is because there is no schema, or if this is actually a bug in pig. If I dump some data from statement B, I definitely see data there that would match those expressions.
So I've tried approaching the problem differently, by loading the data using:
data = load 'myfile' using PigStorage('\n') as (line:chararray);
init_parse = foreach data generate FLATTEN( STRSPLIT( line, '\\|', 4 ) ) as (ff1:chararray, ff2:chararray, ff3:chararray, ff4:chararray, kvpairsStr:chararray);
A = foreach mc_bk_data generate ff1, ff2, ff3, ff4, TOBAG( STRSPLIT( kvpairsStr, '\\|', 500 ) ) as kvpairs:bag{t:(kvpair:chararray)};
The issue here is that the TOBAG( STRSPLIT( ... ) ) results in a bag of a single tuple, with each of the kvpairs being a field in that tuple. I really need the bag to contain, each of the individual kvpairs as a tuple of one field so that when I flatten the bag, I get the cross product of the bag and the group that I'm interested in.
I'm open to other ways of attacking this problem as well, but I can seem to find good way to transform my tuple of multiple fields into a bag of tuples, with each tuple having one field each.
I'm using Apache Pig version 0.11.1.1.3.0.0-107
Thanks in advance.