generating an id/counter for foreach in pig latin
Asked Answered
P

2

6

I want some sort of unique identifier/line_number/counter to be generated/appended in my foreach construct while iterates through the records. Is there a way to accomplish this without writing a UDF?

B = foreach A generate a_unique_id, field1,...etc

How do I get that 'a_unique_id' implemented?

Thanks!

Pinhole answered 3/10, 2011 at 15:44 Comment(0)
N
4

If you are using pig 0.11 or later then the RANK operator is exactly what you are looking for. E.G.

DUMP A;
(foo,19)
(foo,19)
(foo,7)
(bar,90)
(etc.,0)

B = RANK A ;

DUMP B ;
(1,foo,19)
(2,foo,19)
(3,foo,7)
(4,bar,90)
(5,etc.,0)
Nagpur answered 8/10, 2013 at 9:36 Comment(1)
RANK needs ORDER BY to be deterministicGraybeard
A
1

There is no built-in UUID function in the main Pig distribution or piggybank. Unfortunately, I think your only option is going to be writing a UDF.

There is a standard way of building UUIDs and there is Java code out there you can utilize to build off of for your UDF.

Is there a particular reason why you don't want to write a UDF?

Articulate answered 3/10, 2011 at 17:24 Comment(2)
Thanks! This might sound silly, but I am totally new to PIG, read the tutorial for few hours and thought there might be something in-built that can serve my purpose. I suspected UDF would have consumed me some time. Anyway, I got away using unix utility of appending a line number to each record: cat -n.Pinhole
newline doesn't seem to helping all my cases. Here is the original problem. I wanted to JOIN two data-sets with exactly same number of records but with no matching keys. What I want is (line by line or record by record join) but my data is such that there is no unique key for the join condition. natural or cross join is to be avoided. How do I go about "merging" the records of the two data-sets?Pinhole

© 2022 - 2024 — McMap. All rights reserved.