schema of flatten operator in pig latin
Asked Answered
S

1

3

i recently meet this problem in my work, it's about pig flatten. i use a simple example to express it

two files
===file1===
1_a
2_b
4_d

===file2 (tab seperated)===
1 a
2 b
3 c

pig script 1:

a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray);
c = join a1 by num, b by num;
dump c;   -- exception java.lang.String cannot be cast to java.lang.Integer

pig script 2:

a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray);
a2 = foreach a1 generate (int)num as num, ch as ch;
c = join a2 by num, b by num;
dump c;   -- exception java.lang.String cannot be cast to java.lang.Integer

pig script 3:

a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2));
a2 = foreach a1 generate (int)$0 as num, $1 as ch;
c = join a2 by num, b by num;
dump c;   -- right

i don't know why script 1,2 are wrong and script 3 right, and i also want to know is there more concise expression to get relation c, thx.

Sharyl answered 31/8, 2012 at 10:38 Comment(0)
F
4

Is there any particular reason you are not using PigStorage? Because it could make life so much easier for you :) .

a = load '/file1' USING PigStorage('_') AS (num:int, char:chararray);
b = load '/file2' USING PigStorage('\t') AS (num:int, char:chararray);
c = join a by num, b by num;
dump c;

Also note that, in file1 you used underscore as delimiter, but you give "-" as argument to STRSPLIT.

edit: I have spent some more time on the scripts you provided; script 1 & 2 indeed does not work and the script 3 also works like this (without the extra foreach):

a = load 'file1' as (str:chararry);
b = load 'file2' as (num:int, ch:chararry);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2));
c = join a1 by (int)($0), b by num;
dump c;

As for the source of the problem, i'll take a wild guess and say it might be related to this (as stated in Pig Documentation) combined with pig's run cycle optimizations :

If you FLATTEN a bag with empty inner schema, the schema for the resulting relation is null.

In your case, I believe schema of the STRSPLIT result is unknown until runtime.

edit2: Ok, here is my theory explained:

This is the complete -explain- output for script 2 and this is for script 3. I'll just paste the interesting parts here.

|---a2: (Name: LOForEach Schema: num#288:int,ch#289:chararray)
|   |   |
|   |   (Name: LOGenerate[false,false] Schema: num#288:int,ch#289:chararray)ColumnPrune:InputUids=[288, 289]ColumnPrune:OutputUids=[288, 289]
|   |   |   |
|   |   |   (Name: Cast Type: int Uid: 288)
|   |   |   |
|   |   |   |---num:(Name: Project Type: int Uid: 288 Input: 0 Column: (*))

Above section is for script 2; see the last line. It assumes output of flatten(STRSPLIT) will have a first element of type integer (because you provided the schema that way). But in fact STRSPLIT has a null output schema which is treated as bytearray fields; so output of flatten(STRSPLIT) is actually (n:bytearray, c:bytearray). Because you provided a schema, pig tries to make a java cast (to the output of a1) to num field; which fails as num is in fact a java String represented as bytearray. Since this java-cast fails, pig does not even try to make the explicit cast in the line above.

Let's see the situation for script 3:

|---a2: (Name: LOForEach Schema: num#85:int,ch#87:bytearray)
|   |   |
|   |   (Name: LOGenerate[false,false] Schema: num#85:int,ch#87:bytearray)ColumnPrune:InputUids=[]ColumnPrune:OutputUids=[85, 87]
|   |   |   |
|   |   |   (Name: Cast Type: int Uid: 85)
|   |   |   |
|   |   |   |---(Name: Project Type: bytearray Uid: 85 Input: 0 Column: (*))

See the last line, here output of a1 is properly treated as bytearray, no problems here. And now look at the second to last line; pig tries (and succeeds) to make an explicit cast operation from bytearray to integer.

Fini answered 3/9, 2012 at 6:11 Comment(6)
yes, i've used PigStorage at beginning, but the data format like '_' is generated in the process, so i have no choice but STRSPLITSharyl
And the '-' is my carelessness, in practice code that's correctSharyl
thanks for you answer. but i still can't understand, when you describe a1, Pig gives a1: {num: int,ch: chararray}, why this schema can't be used in join operator. And also what the difference between script2 and script3, just one has named filed and the other notSharyl
so you mean the cast exception is happened in **AS clause**, but dump a1 hasn't caused exceptionSharyl
cast exception is caused by the AS clause. though it (AS clause) is not evaluated until you call dump or store.Fini
but i called dump a1 and dump a2, both are ok. Exception only happend after joinSharyl

© 2022 - 2024 — McMap. All rights reserved.