Is it possible to (efficiently) select a random tuple from a bag in pig? I can just take the first result of a bag (as it is unordered), but in my case I need a proper random selection. One (not efficient) solution is counting the number of tuples in the bag, take a random number within that range, loop through the bag, and stop whenever the number of iterations matches my random number. Does anyone know of faster/better ways to do this?
Selecting random tuple from bag
Asked Answered
How would you "loop through the bag"? –
Merriemerrielle
A = FOREACH myBag { --do stuff }; Actually havent implemented this approach, so I'm not sure whether this solution would work as well –
Boron
That won't work; you can't keep track of the number of iterations. –
Merriemerrielle
You could use RANDOM(), ORDER and LIMIT in a nested FOREACH statement to select one element with the smallest random number:
inpt = load 'group.txt' as (id:int, c1:bytearray, c2:bytearray);
groups = group inpt by id;
randoms = foreach groups {
rnds = foreach inpt generate *, RANDOM() as rnd; -- assign random number to each row in the bag
ordered_rnds = order rnds by rnd;
one_tuple = limit ordered_rnds 1; -- select tuple with the smallest random number
generate group as id, one_tuple;
};
dump randoms;
INPUT:
1 a r
1 a t
1 b r
1 b 4
1 e 4
1 h 4
1 k t
2 k k
2 j j
3 a r
3 e l
3 j l
4 a r
4 b t
4 b g
4 h b
4 j d
5 h k
OUTPUT:
(1,{(1,b,r,0.05172709255901231)})
(2,{(2,k,k,0.14351660053632986)})
(3,{(3,e,l,0.0854104195792681)})
(4,{(4,h,b,8.906013598960483E-4)})
(5,{(5,h,k,0.6219490873384448)})
If you run "dump randoms;" multiple times, you should get different results for each run.
Writing a UDF might give you better performance as you do not need to do secondary sort on random within the bag.
I needed to do this myself, and found surprisingly that a very simple answer seems to work, to get about 10% of an alias A:
B = filter A by RANDOM() < 0.1
This is available via the sample function as well:
B = SAMPLE A 0.1
–
Boron As a side note, I've noticed some unusual behaviour when using
RANDOM()
followed by SPLIT
(or a regular FILTER
). In this case, partitioning your dataset A
(into A1
and A2
for example) requires to store your dataset first after applying the RANDOM()
function. Then one can reload the result and split it using SPLIT
or FILTER
in order to obtain a 'true' partition. Otherwise, you might end up with 2 sets such that |A1| + |A2| < |A|
–
Enroll © 2022 - 2024 — McMap. All rights reserved.