Python Dependency Management on EMR
Asked Answered
O

2

6

i'm sending code to amazon's EMR via the mrjob/boto modules. i've got some external python dependencies (ie. numpy, boto, etc) and currently have to download the source of the python packages, and send them over as a tarball in the "python_archives" field of the mrjob.config file.

this makes dependency management messier than i would like, and am wondering if i can somehow use the same requirements.txt file i use for my virtualenv setup to bootstrap the emr instance with my dependencies. is it possible to set up virtualenv's on EMR instances and do something like:

pip install -r requirements.txt

as i would locally?

Occlusive answered 9/7, 2013 at 21:24 Comment(0)
G
3

One way to accomplish this is using a bootstrap action. You can use these to run shell scripts.

If you have a setup python file that does something like:

requirements = open("requirements.txt", "r")
shell_script = open("pip.sh", "w+")
shell_script.write("sudo apt-get install python-pip\n")
for line in requirements:
    shell_script.write("sudo pip install -I " + line)

Then you can just run this as the bootstrap action without needing to upload your requirements.txt

Garbage answered 15/7, 2013 at 22:45 Comment(0)
C
0

So, if you're using mrjob, I've had some success by just putting the pip calls straight into my .mrjob.conf file as a bootstrap action. It's not as elegant as using a requirements.txt file (it'll load the same modules for all your jobs). For example, my conf file looks like:

runners:
  emr:
    aws_access_key_id: xx
    aws_secret_access_key: xx
    ec2_key_pair: xx
    ec2_key_pair_file: xx
    ssh_tunnel_to_job_tracker: true
    bootstrap_cmds:
      - sudo apt-get install -y python-pip
      - sudo pip install pgnparser
      - sudo pip install boto

and that will load the pgnparser and boto modules for me to use in my mrjob scripts.

Caesaria answered 27/12, 2013 at 3:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.