How can I use an external python library in AWS Glue?

Asked 2/10, 2019 at 16:55 Answered 30/4, 2024 at 20:52

python amazon-web-services openpyxl aws-glue

First stack overflow question here. Hope I do this correctly:

I need to use an external python library in AWS glue. "Openpyxl" is the name of the library.

I follow these directions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

However, after I have my zip file saved in the correct s3 location and point my glue job to that location, I'm not sure what to actually write in the script.

I tried your typical Import openpyxl , but that just returns the following error:

ImportError: No module named openpyxl

Obviously I don't know what to do here - also relatively new to programming so I'm not sure if this is a noob question or what. Thanks in advance!

Avraham answered 2/10, 2019 at 16:55 Comment(1)

Is it spark job or python shell job? – Pitcher 2/10, 2019 at 17:1

It depends if the job is Spark or Python Shell. For Spark you just need to zip the library and then when you point the job to the library S3 path, the job will import it. You just need to make sure that the zip contains this file: __init__.py

For example, for the library you are trying to import, if you download it from https://pypi.org/project/openpyxl/#files, you can zip the folder openpyxl inside the openpyxl-3.0.0.tar.gz, and store it in S3.

On the other hand, if it is a Python Shell job, a zip file will not work. You will need to create an egg file from the library. If you are using this version openpyxl-3.0.0, then you can download it from that same website, extract everything, and run the command python setup.py bdist_egg or python3 instead of python if you use python3 instead.

This will generate an egg file inside dist folder which is also generated. You just need to put that egg file in S3 and point the Glue Job Python Libraries to that path.

If you already have the library and for some reason you don't have the setup.py, then you must create it in order to run the command to generate the egg file. Please refer to http://www.blog.pythonlibrary.org/2012/07/12/python-101-easy_install-or-how-to-create-eggs/. There you can find an example.

Johnjohna answered 3/10, 2019 at 9:40 Comment(3)

For python shell, there is no need to download and bundle in egg file. You can use install_requires=['openpyxl==3.0.0'] in setup.py and it will download and install in glue during execution. – Pitcher 3/10, 2019 at 16:44

As Sandeep said, this build process is only needed for custom user libraries. Right now wheels work fine, so no eggs needed. I am still trying to understand their rationale for requiring different formats with Spark vs. Shell though. Would it make things too easy? – Role 10/3, 2022 at 1:11

This might be a silly question but can you also add .json files to the external library zip and access it through the glue script? – Cm 10/8, 2022 at 16:15

You can now (as of Glue version 2) directly add external libraries using --additional-python-modules parameter.

For example to update or to add a new scikit-learn module use the following key/value:

"--additional-python-modules", "scikit-learn==0.21.3".

More details could be found in the docs.

Abdomen answered 19/10, 2020 at 11:24 Comment(2)

It is not working for it, it still gives no module error. Any help? – Balikpapan 3/12, 2021 at 11:6

this is valid for Glue Spark only – Timeserver 4/4, 2024 at 14:22

I spent several hours in the last few days and I could not find a way to make it work.

1. Using a Spark Notebook

In a notebook one can specify something like this in the first cell:

%additional_python_modules ["pydantic==1.10.12"]

But it doesn't work. The notebook shows promising logs such as this one:

Applying the following default arguments:
--glue_kernel_version 0.38.1
--enable-glue-datacatalog true
--additional_python_modules ["pydantic==1.10.12"]

But it actually doesn't import the library (you get a ModuleNotFoundError when trying to import it).

Other configurations in a notebook, such as writing the following code in a cell:

%extra_py_files s3://some-bucket/scripts/some_library.zip

Also do not work despite showing promising logs, such as this one:

Extra py files to be included:
s3://some-bucket/scripts/pymongo/pymongo.zip

2. Using spark scripts.

I also tried creating a spark script. I uploaded a zip file with the package containing an __init__.py file and configured the path in the job details (as shown in the image below), but it didn't work.

Again I saw AWS Logs that seemed to indicate that everything was working well:

23/10/11 20:04:02 INFO SparkContext: 
Added file pymongo.zip at spark://172.35.246.64:43003/files/pymongo.zip with timestamp 1697054641877

Novelia answered 11/10, 2023 at 21:2 Comment(0)

I struggled for a couple hours with this: being on Glue 4.0, my solution was just to change the version back to Glue 3.0 with %glue_version 3.0

From there use the magic %additional_python_modules library==version... Glue didn't read my library when I tried the magic originally on 4.0, but it works on Glue 3.0

Brezhnev answered 30/4, 2024 at 20:52 Comment(2)

note: I created my account to post this because I have a disdain for AWS and their poor documentation. – Brezhnev 30/4, 2024 at 20:53

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Bolshevist 1/5, 2024 at 21:49

-1

You may use following boilerplate code to use extra files as well as external libraries - https://github.com/fatangare/aws-python-shell-deploy

Pitcher answered 4/11, 2019 at 8:11 Comment(0)

1. Using a Spark Notebook

2. Using spark scripts.

Recommended topics

Hot tags