Generating all fields from an alias after a JOIN in Pig
Asked Answered
N

4

8

I would like to perform the equivalent of "keep all a in A where a.field == b.field for some b in B" in Apache Pig. I am implementing it like so,

AB_joined = JOIN A by field, B by field;
A2 = FOREACH AB_joined GENERATE A::field as field, A::field2 as field2, A::field3 as field3;

Enumerating all of A's entries is quite silly, and I would rather do something like,

A2 = FOREACH AB_joined GENERATE flatten(A);

However, this doesn't seem to work. Is there some other way I can do something equivalent without enumerating A's fields?

Nari answered 30/5, 2012 at 23:23 Comment(0)
F
5

This should work:

A2 = FOREACH AB_joined GENERATE $0..
Fatso answered 29/8, 2012 at 14:40 Comment(0)
E
3

You can use COGROUP to keep the columns of A separate from columns of B. This is especially useful when A's schema is dynamic and you don't want your code to fail when A's schema changes.

AB = COGROUP A BY field, B BY field;

-- schema of AB will be:
-- {group, A:{all fields of A}, B:{all fields of B}}

A2 = FOREACH AB FLATTEN(A);

Hope this helps.

Eyewash answered 15/9, 2014 at 11:48 Comment(3)
This is nice! However it generates field names with the A:: prefix. Do you know if there is a way to get rid of that?Acinus
@HariShankar you can always project the elements using AS.... e.g. A2 = FOREACH AB FLATTEN(A) AS (a1:string, a2:long);Eyewash
Yes I can do that but that involves rewriting all the fields which means the script needs to be changed whenever a new field is added :(Acinus
B
2

Starting from at least pig 0.9.1 you can use either Star Expressions or Project-Range Expressions to select multiple fields from tuple. Read Pig Latin 0.15.0, Expressions chapter for details.

Here is my example which I made just to give you understanding.

-- A: {id: long, f1: int, f2: int, f3: int, f4: int}
-- B: {id: long, f5: int}

Let's join A & B and select only A's fields

AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$4;
--AB: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int}

or

BA = FOREACH (JOIN B BY id, A by id) GENERATE $2..;
--BA: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int}

selecting all fields using Star expression

AB = FOREACH (JOIN A BY id, B by id) GENERATE *;
--AB: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int, B::id: long, B::f5: int}

selecting all distinct fields (without B::id field) using Project-range expression

AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$4, f5;
--AB: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int, B::f5: int}

Sometimes it's really useful when you have tens of fields in one relation and only couple in another.

Basia answered 21/7, 2015 at 11:15 Comment(5)
is there a way to dynamically get the end column count? I am trying to create a generic script that I can pass multiple schemas from shell script. so depending on the schema passed the columns will be different. how can I solve this?Fecundity
Not sure I understood you right. Do you want to get a number of columns in a relation or want to count all rows having not null value in the last column?Basia
I am passing schema at runtime to pig script based on the parameter that is passed during invoking of my shell script. so, it may be that during one run, that file passed has only 10 columns and in another run file that is passed has 30 columns.. how can I count the number of columns for a given schema? so I can put something like this AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$10, f5;Fecundity
I am passing schema at runtime to pig script based on the parameter that is passed during invoking of my shell script. so, it may be that during one run, that file passed has only 10 columns and in another run file that is passed has 30 columns.. how can I count the number of columns for a given schema? so I can put something like this AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$10, f5; AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$30, f5;Fecundity
I don't know the easy way to do this. Because you can't pass pre-calculated parameter into foreach range expression like BA = FOREACH (JOIN B BY id, A by id) GENERATE $2..$aLength, f5;, but you can write your own UDF for filtering join result inside it.Basia
B
2

With Pig 12 and above, use PluckTuple: https://pig.apache.org/docs/r0.12.0/func.html#plucktuple.

AB_joined = JOIN A by field, B by field;
DEFINE pluck PluckTuple('A::');
A2 = FOREACH AB_joined generate FLATTEN(pluck(*));
Bloomy answered 9/3, 2017 at 17:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.