avoiding prefixes in multi relation join in pig
Asked Answered
D

2

1

I am trying to do a star schema type of join in pig and below is my code. When I join multiple relations with different columns, I have to prefix the name of the previous join every time to get it working. I am sure there should be some better way, I am not able to find it through googling. Any pointers will be very helpful.

i.e prefixing a column like this "H864::H86::hs_8_d::hs_8_desc" is what I want to avoid.

hs_8 = LOAD 'hs_8_distinct' USING PigStorage('^') as (hs_8:chararray,hs_8_desc:chararray);
hs_8_d = FOREACH hs_8 GENERATE SUBSTRING(hs_8,0,2) as hs_2,SUBSTRING(hs_8,0,4) as hs_4,SUBSTRING(hs_8,0,6) as hs_6,hs_8,hs_8_desc;

hs_6_d = LOAD 'hs_6_distinct' USING PigStorage('^') as (hs_6:chararray,hs_6_desc:chararray);
hs_4_d = LOAD 'hs_4_distinct' USING PigStorage('^') as (hs_4:chararray,hs_4_desc:chararray);
hs_2_d = LOAD 'hs_2_distinct' USING PigStorage('^') as (hs_2:chararray,hs_2_desc:chararray);

H86 = JOIN hs_8_d BY hs_6, hs_6_d BY hs_6 USING 'replicated' ;
H864 = JOIN H86 BY hs_8_d::hs_4, hs_4_d BY hs_4 USING 'replicated' ;
H8642 = JOIN H864 BY H86::hs_8_d::hs_2, hs_2_d BY hs_2 USING 'replicated' ;

hs_dim = FOREACH H8642 GENERATE hs_2_d::hs_2,hs_2_d::hs_2_desc,H864::hs_4_d::hs_4,H864::hs_4_d::hs_4_desc,H864::H86::hs_6_d::hs_6,H864::H86::hs_6_d::hs_6_desc,H864::H86::hs_8_d::hs_8,H864::H86::hs_8_d::hs_8_desc;
Deictic answered 22/6, 2013 at 22:19 Comment(0)
L
2

By adding extra foreach to the joins you can slightly simplify the aliases. Check the statistics, this won't add extra MR jobs to the pipeline. The original and this will yield to 4 map-only jobs.

E.g:

H86 = foreach (JOIN hs_8_d BY hs_6, hs_6_d BY hs_6 USING 'replicated') generate 
        hs_8_d::hs_2 as x1, 
        hs_8_d::hs_4 as x2, 
        hs_8_d::hs_6 as x3, 
        hs_8_d::hs_8 as x4,
        hs_8_d::hs_8_desc as x5, 
        hs_6_d::hs_6 as x6,
        hs_6_d::hs_6_desc as x7;

H864 = foreach (JOIN H86 BY x2, hs_4_d BY hs_4 USING 'replicated') generate 
          H86::x1 as y1,
          H86::x2 as y2, 
          H86::x3 as y3,
          H86::x4 as y4,
          H86::x5 as y5, 
          H86::x6 as y6, 
          H86::x7 as y7,
          hs_4_d::hs_4 as y8,
          hs_4_d::hs_4_desc as y9;

H8642 = foreach (JOIN H864 BY y1, hs_2_d BY hs_2 USING 'replicated') generate 
          H864::y1 as z1, 
          H864::y2 as z2,
          H864::y3 as z3, 
          H864::y4 as z4, 
          H864::y5 as z5, 
          H864::y6 as z6, 
          H864::y7 as z7,
          H864::y8 as z8, 
          H864::y9 as z9, 
          hs_2_d::hs_2 as z10, 
          hs_2_d::hs_2_desc as z11;

hs_dim = FOREACH H8642 GENERATE z10, z11, z8, z9, z6, z7, z4, z5;

If you have a bag of tuples, then Datafu's AliasBagFields may be helpful.

Leftward answered 24/6, 2013 at 10:15 Comment(0)
P
0

Pig will always prefixes fields with bagname:: to disambiguate fields after joins. I don't think you can avoid this unfortunately.

Primalia answered 22/6, 2013 at 23:2 Comment(1)
it gets complex when I have more than 3 relations and I find it difficult to derive the lengthy prefixes, how do you handle this case? or is there a easy way to derive the prefixes? I think it can be super complex if i am joining 20+ relations. I am pig newbie.. so curious to know how to handle this in pigDeictic

© 2022 - 2024 — McMap. All rights reserved.