Reference columns in a FOREACH after a JOIN?
Asked Answered
I

1

21
A = load 'a.txt' as (id, a1);
B = load 'b.txt as (id, b1);
C = join A by id, B by id;
D = foreach C generate id,a1,b1;
dump D;

4th line fails on: Invalid field projection. Projected field [id] does not exist in schema

I tried to change to A.id but then the last line fails on: ERROR 0: Scalar has more than one row in the output.

Implicative answered 8/11, 2011 at 13:32 Comment(0)
B
49

What you are looking for is the "Disambiguate Operator". What you want is A::id, not A.id.

A.id says "there is a relation/bag A and there is a column called id in its schema"

A::id says "there is a record from A and that has a column called id"

So, you would do:

A = load 'a.txt' as (id, a1);
B = load 'b.txt as (id, b1);
C = join A by id, B by id;
D = foreach C generate A::id,a1,b1;
dump D;

A dirty alternative:

Just because I'm lazy, and disambiguation gets really weird when you start doing multiple joins one after another: use unique identifiers.

A = load 'a.txt' as (ida, a1);
B = load 'b.txt as (idb, b1);
C = join A by ida, B by idb;
D = foreach C generate ida,a1,b1;
dump D;
Bracteate answered 9/11, 2011 at 16:5 Comment(3)
Thanks for the explanation. Adding a link to this related question.Acrobatic
If I have any doubts what is the name of a field, I just use DESCRIBE. In your case DESCRIBE C; gives us C: {A::id: bytearray,A::a1: bytearray,B::id: bytearray,B::b1: bytearray}. Now you can easily find proper field name.Vulcanize
Is there any way to access the names of the fields taken from 'A' within the join result dynamically via some sort of introspection? For example, what if I don't know the schema of 'A' until runtime and then I want the resulting projection on line 4 to only include the columns from 'A' and not 'B'? Thanks.Britannia

© 2022 - 2024 — McMap. All rights reserved.