Pig: is it possible to use pytz or dateutils for Python udfs?
Asked Answered
H

2

6

I am using datetime in some Python udfs that I use in my pig script. So far so good. I use pig 12.0 on Cloudera 5.5

However, I also need to use the pytz or dateutil packages as well and they dont seem to be part of a vanilla python install.

Can I use them in my Pig udfs in some ways? If so, how? I think dateutil is installed on my nodes (I am not admin, so how can I actually check that is the case?), but when I type:

import sys
#I append the path to dateutil on my local windows machine. Is that correct?
sys.path.append('C:/Users/me/AppData/Local/Continuum/Anaconda2/lib/site-packages')

from dateutil import tz

in my udfs.py script, I get:

2016-08-30 09:56:06,572 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1121: Python Error. Traceback (most recent call last):
  File "udfs.py", line 23, in <module>
    from dateutil import tz
ImportError: No module named dateutil

when I run my pig script.

All my other python udfs (using datetime for instance) work just fine. Any idea how to fix that?

Many thanks!

UPDATE

after playing a bit with the python path, I am now able to

import dateutil 

(at least Pig does not crash). But if I try:

from dateutil import tz

I get an error.

  from dateutil import tz 
  File "/opt/python/lib/python2.7/site-packages/dateutil/tz.py", line 16, in <module>
    from six import string_types, PY3
  File "/opt/python/lib/python2.7/site-packages/six.py", line 604, in <module>
    viewkeys = operator.methodcaller("viewkeys")
AttributeError: type object 'org.python.modules.operator' has no attribute 'methodcaller'

How to overcome that? I use tz in the following manner

to_zone = dateutil.tz.gettz('US/Eastern')
from_zone = dateutil.tz.gettz('UTC')

and then I change the timezone of my timestamps. Can I just import dateutil to do that? what is the proper syntax?

UPDATE 2

Following yakuza's suggestion, I am able to

import sys
sys.path.append('/opt/python/lib/python2.7/site-packages')
sys.path.append('/opt/python/lib/python2.7/site-packages/pytz/zoneinfo')

import pytz

but now I get and error again

Caused by: Traceback (most recent call last): File "udfs.py", line 158, in to_date_local File "__pyclasspath__/pytz/__init__.py", line 180, in timezone pytz.exceptions.UnknownTimeZoneError: 'America/New_York'

when I define

to_zone = pytz.timezone('America/New_York')
from_zone = pytz.timezone('UTC')

Found some hints here UnknownTimezoneError Exception Raised with Python Application Compiled with Py2Exe

What to do? Awww, I just want to convert timezones in Pig :(

Handcrafted answered 26/8, 2016 at 22:52 Comment(3)
Regarding your second update, git.launchpad.net/pytz/tree/src/pytz/__init__.py#n180 suggests that 'America/New_York' is not in all_timezones_set. From source code it seems that this exception is either thrown if timezone is not composed of ASCII characters, or is not in known timezones list. Verify if your installation is not corrupted and that this entry is actually located in pytz/__init__.py file.Darnelldarner
I am trying right now with US/Eastern. That should work, right?Filippa
Well I don't believe your issue lies in which timezone you pick, both of them should be available out of the box in pytz - so yes, that should work.Darnelldarner
D
4

Well, as you probably know all Python UDF functions are not executed by Python interpreter, but Jython that is distributed with Pig. By default in 0.12.0 it should be Jython 2.5.3. Unfortunately six package supports Python starting from Python 2.6 and it's package required by dateutil. However pytz seems not to have such dependency, and should support Python versions starting from Python 2.4.

So to achieve your goal you should distribute pytz package to all your nodes for version 2.5 and in your Pig UDF add it's path to sys.path. If you complete same steps you did for dateutil everything should work as you expect. We are using very same approach with pygeoip and it works like a charm.

How does it work

When you run Pig script that references some Python UDF (more precisely Jython UDF), you script gets compiled to map/reduce job, all REGISTERed files are included in JAR file, and are distributed on nodes where code is actually executed. Now when your code is executed, Jython interpreter is started and executed from Java code. So now when Python code is executed on each node taking part in computation, all Python imports are resolved locally on node. Imports from standard libraries are taken from Jython implementation, but all "packages" have to be install otherwise, as there is no pip for it. So to make external packages available to Python UDF you have to install required packages manually using other pip or install from sources, but remember to download package compatible with Python 2.5! Then in every single UDF file, you have to append site-packages on each node, where you installed packages (it's important to use same directory on each node). For example:

import sys
sys.path.append('/path/to/site-packages')
# Imports of non-stdlib packages

Proof of concept

Let's assume some we have following files:

/opt/pytz_test/test_pytz.pig:

REGISTER '/opt/pytz_test/test_pytz_udf.py' using jython as test;

A = LOAD '/opt/pytz_test/test_pytz_data.csv' AS (timestamp:int);
B = FOREACH A GENERATE
    test.to_date_local(timestamp);

STORE B INTO '/tmp/test_pytz_output.csv' using PigStorage(',');

/opt/pytz_test/test_pytz_udf.py:

from datetime import datetime
import sys

sys.path.append('/usr/lib/python2.6/site-packages/')

import pytz

@outputSchema('date:chararray')
def to_date_local(unix_timestamp):
    """
    converts unix timestamp to a rounded date
    """
    to_zone = pytz.timezone('America/New_York')
    from_zone = pytz.timezone('UTC')

    try :
        as_datetime = datetime.utcfromtimestamp(unix_timestamp)
            .replace(tzinfo=from_zone).astimezone(to_zone)
            .date().strftime('%Y-%m-%d')
    except:
        as_datetime = unix_timestamp
    return as_datetime

/opt/pytz_test/test_pytz_data.csv:

1294778181
1294778182
1294778183
1294778184

Now let's install pytz on our node (it has to be installed using Python version on which pytz is compatible with Python 2.5 (2.5-2.7), in my case I'll use Python 2.6):

sudo pip2.6 install pytz

Please make sure, that file /opt/pytz_test/test_pytz_udf.py adds to sys.path reference to site-packages where pytz is installed.

Now once we run Pig with our test script:

pig -x local /opt/pytz_test/test_pytz.pig

We should be able to read output from our job, which should list:

2011-01-11
2011-01-11
2011-01-11
2011-01-11
Darnelldarner answered 31/8, 2016 at 22:10 Comment(19)
thanks but now I get Caused by: Traceback (most recent call last): File "udfs.py", line 158, in to_date_local File "__pyclasspath__/pytz/__init__.py", line 180, in timezone pytz.exceptions.UnknownTimeZoneError: 'America/New_York'Filippa
after doing some research it appears this can be due to the fact that the timezones are stored into another folder than sys.path.append('C:/Users/me/AppData/Local/Continuum/Anaconda2/lib/site-packages').. Any ideas what to do?Filippa
Well I guess solution would be to place this information there (#21717911). By looking at pytz source, there is a method called open_resource having this comment: "Open a resource from the zoneinfo subdir for reading. Uses the pkg_resources module if available and no standard file found at the calculated location." So best solution would be to place database in one of these locations.Darnelldarner
what do you mean? what should I do with open_resource?Filippa
Sorry, pressed "enter" too quickly. Edited above.Darnelldarner
sorry for my noobiness but can you tell me how could I do that? what would be the code to include in my udfs?Filippa
Yes, sure. If you do clean install of pytz it creates directory structure in your local dist-packages. In my case it would be: /usr/local/lib/python2.7/dist-packages/pytz. Inside you will find folder named: zoneinfo. What you need to do is to make sure, that this folder is distributed on all nodes to where pytz is installed. Just like it should be after proper installation.Darnelldarner
yes but this is the problem: all the nodes have a proper anaconda distribution so they all have pytz already...Filippa
Let us continue this discussion in chat.Darnelldarner
Thanks again yakuza for your kind help. Let me try that. But what do you mean exactly by Please make sure, that file /opt/pytz_test/test_pytz_udf.py adds to sys.path reference to site-packages where pytz is installed. ?Filippa
oh oki I get it, just adding the path where pytz is installed. but that was already the case... :(Filippa
So let's dig a bit deeper. To make sure everything in your file system is set up properly, log somewhere in your udf result of this code: pytz.resource_exists('America/New_York') Best way would be to dump it to file or raise RuntimeError with proper message. In fact you could also make use of following information: str(pytz.all_timezones)Darnelldarner
Hi Yakuza. I think we re close. I have pytz installed in all my nodes. question is, the path in my udfs refers to which path? The one on the main computer or the ones on my nodes? Strangely enough, the line with UTC does work, the line with America/New_York or US/Eastern causes the pytz.exceptions.UnknownTimeZoneError: 'US/Eastern'Filippa
As you can find in pytz source git.launchpad.net/pytz/tree/src/pytz/__init__.py#n89 if you provide time zone 'US/Eastern' it looks for file: __file__/zoneinfo/US/Eastern and file is path to your pytz/__init__.py file. So it should definitely look for the ones on your nodes, precisely on node where your Pig code is being executed.Darnelldarner
Oh, and UTC is working because it's special case and is not being loaded from file system: git.launchpad.net/pytz/tree/src/pytz/__init__.py#n244Darnelldarner
great! I will try some things. Im pretty sure this question is gonna have a lot of views. So assuming that each package is installed in a different directory (in the nodes, in the master), then I should add all of them in the udfs.py file. Correct?Filippa
Correct, if you have pytz installed in different directories on different nodes then you have to include all possible locations on all nodes, as you can't tell on which one code will be executed.Darnelldarner
Hi @Yakuza, I guess its time for the very last question ;-) I have my pytz package installed, and I see the egg file. Should I add the path to the egg?Filippa
Hey @Noobie :) no, it's not necessary, you simply have to include path to all site-packages/dist-packages where pytz is installed on all nodes. Path where pytz looks for it's files is computed relatively to pytz directory in which package resides.Darnelldarner
D
1

From the answer to a different but related question, it seems that you should be able to use resources as long as they are available on each of the nodes.

I think you can then add the path as described in this answer regarding jython, and load the modules as usual.

Append the location to the sys.path in the Python script:

import sys
sys.path.append('/usr/local/lib/python2.7/dist-packages')
import happybase
Doily answered 29/8, 2016 at 12:11 Comment(3)
thanks but I still get org.apache.pig.backend.executionengine.ExecException: ERROR 1121: Python Error. Traceback (most recent call last): File "udfs.py", line 18, in <module> from dateutil import tz ImportError: No module named dateutilFilippa
and python-dateutil has been installed in all my nodesFilippa
@Noobie Are you able to do the import when running a python script manually on the slave node?-- Perhaps you need to append the package location to the path, I have edited the answer.Doily

© 2022 - 2024 — McMap. All rights reserved.