Self cross-join in pig is disregarded
Asked Answered
N

2

9

If one have data like those:

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)

And then a cross-join is done on A, A:

B = CROSS A, A;

DUMP B;
(1,2,3)
(4,2,1)

Why is second A optimized out from the query?

info: pig version 0.11

== UPDATE ==

If I sort A like:

C = ORDER A BY a1;
D = CROSS A, C;

It will give a correct cross-join.

Nitaniter answered 6/3, 2013 at 19:48 Comment(0)
R
10

I think you have to load the data twice to achieve what you want.

i.e.

A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = LOAD 'data' AS (a1:int,a2:int,a3:int);
B = CROSS A1, A2;
Rivera answered 6/3, 2013 at 19:58 Comment(1)
It's because of type of map-reduce jobs that are spawned in the background: however you do the join, you'll need two separate inputs.Rivera
S
14

davek is correct -- you cannot CROSS (or JOIN) a relation with itself. If you wish to do this, you must create a copy of the data. In this case, you can use another LOAD statement. If you want to do this with a relation further down a pipeline, you'll need to duplicate it using FOREACH.

I have several macros that I use frequently and IMPORT by default in all of my Pig scripts in case I need them. One is used for just this purpose:

DEFINE DUPLICATE(in) RETURNS out
{
        $out = FOREACH $in GENERATE *;
};

This will work for you wherever in your pipeline you need a duplicate:

A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = DUPLICATE(A1);
B = CROSS A1, A2;

Note that even though A1 and A2 are identical, you cannot assume that the records are in the same order. But if you are doing a CROSS or JOIN, this probably doesn't matter.

Siberson answered 6/3, 2013 at 21:49 Comment(0)
R
10

I think you have to load the data twice to achieve what you want.

i.e.

A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = LOAD 'data' AS (a1:int,a2:int,a3:int);
B = CROSS A1, A2;
Rivera answered 6/3, 2013 at 19:58 Comment(1)
It's because of type of map-reduce jobs that are spawned in the background: however you do the join, you'll need two separate inputs.Rivera

© 2022 - 2024 — McMap. All rights reserved.