How to : Python UDF dictionary return schema in PIG
Asked Answered
D

1

8

What is the output schema to return a dictionary from Python UDF while using Apache PIG.

I have a dictionary of dictionaries, something like this:

dict = {x:{a:1,b:2,c:3}, y:{d:1,e:3,f:9}}

and my output schema looks like

@outputSchema("m:map[im:map[X:float,Y:float]]") 

** square brackets because in Pig we use [] for map which this dictionary is converted to.

Diachronic answered 12/11, 2012 at 19:55 Comment(0)
S
4

If you are using the standard jython UDFs and not any other distribution such as the streaming_python provided by mortar data, all you need to do is:

@outputSchema('m:map[]') 

The keys will be the same that you have set in python. If you have another dictionaries within your dict you should not worry about it, pig will understand it and use the following syntax:

([first#{third=inner_dict},first#outter_dict])

There is one big disadvantage about passing dict back to pig from a jython UDF, you are only able to set one datatype for all the values in the dict, meaning that if you don't set any datatype pig will use bytearray as the data type and this could be a problem when working with dates or complex structures. For example:

@outputSchema('m:map[chararray]')

Tuples and Bags:

When you want to return a tuple or a bag back to pig from a jython UDFs it is useful to remember that python's lists convert to bags and tuples to tuples. For example:

Lists:

@outputSchema('m:bag{chararray}')

Remember that Pig bags are filled with tuples, so if you want to set a nice structure for your bag, you could declare a tuple within the bag, and there you will be able to set all the datatypes you will be passing. Example:

@outputSchema('map_reduce:bag{t:(key:chararray,value:int,start_date:datetime,end_date:datetime)}')

Finally, tuples should be somehow intuitive, they are the easiest structure to use when using jython. Within a tuple you can set as many fields that you want and as many levels as you want as long as you follow the examples above. You could declare a tuple within a tuple, a tuple that has a bag and other values, etc.

I strongly recommend using Java UDFs when trying to perform complex operations or working with complex data types such as JSON structures, arrays and lists. The learning curve can be a little more steep, but once you have passed that, your development will be much faster and also the throughput of your program.

Shamus answered 3/12, 2014 at 14:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.