Selecting random tuple from bag

Asked 30/1, 2013 at 12:43 Answered 14/8, 2015 at 4:53

Is it possible to (efficiently) select a random tuple from a bag in pig? I can just take the first result of a bag (as it is unordered), but in my case I need a proper random selection. One (not efficient) solution is counting the number of tuples in the bag, take a random number within that range, loop through the bag, and stop whenever the number of iterations matches my random number. Does anyone know of faster/better ways to do this?

Boron answered 30/1, 2013 at 12:43 Comment(3)

How would you "loop through the bag"? – Merriemerrielle 30/1, 2013 at 16:49

A = FOREACH myBag { --do stuff }; Actually havent implemented this approach, so I'm not sure whether this solution would work as well – Boron 31/1, 2013 at 9:47

That won't work; you can't keep track of the number of iterations. – Merriemerrielle 31/1, 2013 at 15:2

You could use RANDOM(), ORDER and LIMIT in a nested FOREACH statement to select one element with the smallest random number:

inpt = load 'group.txt' as (id:int, c1:bytearray, c2:bytearray);
groups = group inpt by id;
randoms = foreach groups {
    rnds = foreach inpt generate *, RANDOM() as rnd; -- assign random number to each row in the bag
    ordered_rnds = order rnds by rnd;
    one_tuple = limit ordered_rnds 1; -- select tuple with the smallest random number
    generate group as id, one_tuple;
};

dump randoms;

INPUT:

OUTPUT:

(1,{(1,b,r,0.05172709255901231)})
(2,{(2,k,k,0.14351660053632986)})
(3,{(3,e,l,0.0854104195792681)})
(4,{(4,h,b,8.906013598960483E-4)})
(5,{(5,h,k,0.6219490873384448)})

If you run "dump randoms;" multiple times, you should get different results for each run.

Writing a UDF might give you better performance as you do not need to do secondary sort on random within the bag.

Stocks answered 31/1, 2013 at 15:1 Comment(0)

I needed to do this myself, and found surprisingly that a very simple answer seems to work, to get about 10% of an alias A:

B = filter A by RANDOM() < 0.1

Viand answered 14/8, 2015 at 4:53 Comment(2)

This is available via the sample function as well: B = SAMPLE A 0.1 – Boron 14/8, 2015 at 6:38

As a side note, I've noticed some unusual behaviour when using RANDOM() followed by SPLIT (or a regular FILTER). In this case, partitioning your dataset A (into A1 and A2 for example) requires to store your dataset first after applying the RANDOM() function. Then one can reload the result and split it using SPLIT or FILTER in order to obtain a 'true' partition. Otherwise, you might end up with 2 sets such that |A1| + |A2| < |A| – Enroll 2/1, 2017 at 13:34

Recommended topics

Hot tags