pig - split, lack of default or if/else
Asked Answered
G

2

6

Since there is no else or default statements in pig split operation what would be the most elegant way to do the following? I'm not a big fan of having to copy paste code.

SPLIT rawish_data
    INTO good_rawish_data IF (
    (uid > 0L) AND
    (value1 > 0) AND
    (value1 < 100) AND
    (value1 IS NOT NULL) AND
    (value2 > 0L) AND
    (value2 < 200L) AND
    (value3 >= 0) AND
    (value3 <= 300)),

    bad_rawish_data IF (NOT (
    (uid > 0L) AND
    (value1 > 0) AND
    (value1 < 100) AND
    (value1 IS NOT NULL) AND
    (value2 > 0L) AND
    (value2 < 200L) AND
    (value3 >= 0) AND
    (value3 <= 300)));

I would like to do something like

SPLIT data
    INTO good_data IF (
    (value > 0)),
    good_data_big_values IF (
    (value > 100)),
    bad_data DEFAULT;

Is anything like this possible in anyway?

Grissel answered 20/9, 2013 at 9:51 Comment(0)
M
11

It is. Checking out the docs for SPLIT, you want to use OTHERWISE. For example:

SPLIT data
    INTO good_data IF (
    (value > 0)),
    good_data_big_values IF (
    (value > 100)),
    bad_data OTHERWISE;

So you almost got it. :)

NOTE: SPLIT can put a single row into both good_data and good_data_big_values if, for example, value was 150. I don't know if this is what you want, but you should be aware of it regardless. This also means that bad_data will only contain rows where value is 0 or less.

Menzies answered 20/9, 2013 at 15:34 Comment(1)
Important note: bad_data will NOT contain rows where value is null! You need to specifically check for null or those rows will be dropped in this expression.Chorography
E
2

You could write an IsGood() UDF where all the conditions are checked. Then your pig is simply

SPLIT data
    INTO good_data IF (IsGood(data))
         good_data_big_values IF (IsGood(data) AND value > 100)),
         bad_data IF (NOT IsGood(data))
;

Another option might be to use a macro

Emotion answered 20/9, 2013 at 15:38 Comment(2)
If you use a recently checked out Pig from trunk, then using macros is an option, otherwise you may run into troubles. See: issues.apache.org/jira/browse/PIG-3239Brickey
Are you sure you can use IsGood(data) like that? Wouldn't you have to pass each (relevant) field like IsGood(value, uid, etc.)?Menzies

© 2022 - 2024 — McMap. All rights reserved.