How to remove duplicate columns after a JOIN in Pig?
Asked Answered
H

1

10

Let's say I JOIN two relations like:

-- part looks like:
-- 1,5.3
-- 2,4.9
-- 3,4.9

-- original looks like:
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,akhila,3.3,IT,C,1.3,0.3

jnd = JOIN part BY $0, original BY $0;

The output will be:

1,5.3,1,Anju,3.6,IT,A,1.6,0.3
2,4.9,2,Remya,3.3,EEE,B,1.6,0.3
3,4.9,3,akhila,3.3,IT,C,1.3,0.3

Notice that $0 is shown twice in each tuple. EG:

1,5.3,1,Anju,3.6,IT,A,1.6,0.3
^     ^
|-----|

I can remove the duplicate key manually by doing:

jnd = foreach jnd generate $0,$1,$3,$4 ..;

Is there a way to remove this dynamically? Like remove(the duplicate key joiner).

Hypoglycemia answered 20/4, 2014 at 5:13 Comment(2)
As far as I know there is no way to do this currently with Pig. Additionally, the only way other than your current solution I can think of is FOREACH foo GENERATE FLATTEN(remove(TOTUPLE(*), 2)); where 2 is the position of the column to remove and remove is a UDF. I think your solution much better, however you can shorten it to FOREACH jnd GENERATE $0, $1, $3 ..;.Overwrought
I wish there was a way to do this easier, but here is a nice syntax for defining the output of the join manually: https://mcmap.net/q/1168233/-avoiding-prefixes-in-multi-relation-join-in-pigDeadpan
M
4

Have faced the same kind of issue while working on Data Set Joining and other data processing techniques where in output the column names get repeated.

So was working on UDF which will remove the duplicates column by using schema name of that field and retaining the first unique column occurrence data.

Pre-Requisite:

Name of all the fields should be present

You need to download this UDF file and make it jar so as to use it.

UDF file location from GitHub : GitHub UDF Java File Location

We will take the above question as example.

--Data Set A contains this data
-- 1,5.3
-- 2,4.9
-- 3,4.9

--Data Set B contains this data
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,Akhila,3.3,IT,C,1.3,0.3

PIG Script:

REGISTER /home/user/
DSA = LOAD '/home/user/DSALOC' AS (ROLLNO:int,CGPA:float);
DSB = LOAD '/home/user/DSBLOC' AS (ROLLNO:int,NAME:chararray,SUB1:float,BRANCH:chararray,GRADE:chararray,SUB2:float);
JOINOP = JOIN DSA BY ROLLNO,DSB BY ROLLNO;

We will get column name after joining as DSA::ROLLNO:int,DSA::CGPA:float,DSB::ROLLNO:int,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

For making it to DSA::ROLLNO:int,DSA::CGPA:float,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

DSB::ROLLNO:int is removed.

We need to use the UDF as

JOINOP_NODUPLICATES = FOREACH JOINOP GENERATE FLATTEN(org.imagine.REMOVEDUPLICATECOLUMNS(*));

Where org.imagine.REMOVEDUPLICATECOLUMNS is the UDF.

This UDF removes duplicate columns by using Name in schema.So DSA::ROLLNO:int is retained and DSB::ROLLNO:int is removed from the dataset.

Mcmurray answered 15/8, 2014 at 8:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.