Apache Pig: strip namespace prefix (::) after group operation

Asked 11/6, 2012 at 22:42 Answered 26/11, 2012 at 20:11

A common pattern in my data processing is to group by some set of columns, apply a filter, then flatten again. For example:

my_data_grouped = group my_data by some_column;
my_data_grouped = filter my_data_grouped by <some expression>;
my_data = foreach my_data_grouped flatten(my_data);

The problem here is that if my_data starts with a schema like (c1, c2, c3) after this operation it will have a schema like (mydata::c1, mydata::c2, mydata::c3). Is there a way to easily strip off the "mydata::" prefix if the columns are unique?

I know I can do something like this:

my_data = foreach my_data generate c1 as c1, c2 as c2, c3 as c3;

However that gets awkward and hard to maintain for data sets with lots of columns and is impossible for data sets with variable columns.

Odor answered 11/6, 2012 at 22:42 Comment(0)

If all fields in a schema have the same set of prefixes (e.g. group1::id, group1::amount, etc) you can ignore the prefix when referencing specific fields (and just reference them as id, amount, etc)

Alternatively, if you're still looking to strip a schema of a single level of prefixing you can use a UDF like this:

public class RemoveGroupFromTupleSchema extends EvalFunc<Tuple> {

@Override
public Tuple exec(Tuple input) throws IOException {
    Tuple result = input;
    return result;
}


@Override
public Schema outputSchema(Schema input) throws FrontendException {
    if(input.size() != 1) {
        throw new RuntimeException("Expected input (tuple) but input does not have 1 field");
    }

    List<Schema.FieldSchema> inputSchema = input.getFields();
    List<Schema.FieldSchema> outputSchema = new ArrayList<Schema.FieldSchema>(inputSchema);
    for(int i = 0; i < inputSchema.size(); i++) {
        Schema.FieldSchema thisInputFieldSchema = inputSchema.get(i);
        String inputFieldName = thisInputFieldSchema.alias;
        Byte dataType = thisInputFieldSchema.type;

        String outputFieldName;
        int findLoc = inputFieldName.indexOf("::");
        if(findLoc == -1) {
            outputFieldName = inputFieldName;
        }
        else {
            outputFieldName = inputFieldName.substring(findLoc+2);
        }
        Schema.FieldSchema thisOutputFieldSchema = new Schema.FieldSchema(outputFieldName, dataType);
        outputSchema.set(i, thisOutputFieldSchema);
    }

    return new Schema(outputSchema);
}
}

Ere answered 26/11, 2012 at 20:11 Comment(1)

How to use this UDF? Thanks in advance. – Pouched 26/2, 2016 at 9:4

You can put the 'AS' statement on the same line as the 'foreach'.

i.e.

my_data_grouped = group my_data by some_column;
my_data_grouped = filter my_data_grouped by <some expression>;
my_data = FOREACH my_data_grouped FLATTEN(my_data) AS (c1, c2, c3);

However, this is just the same as doing it on 2 lines, and does not alleviate your issue for 'data sets with variable columns'.

September answered 19/6, 2012 at 15:31 Comment(0)

Recommended topics

Hot tags