How to use extra files for AWS glue job
Asked Answered
A

3

19

I have an ETL job written in python, which consist of multiple scripts with following directory structure;

my_etl_job
 |
 |--services
 |  |
 |  |-- __init__.py
 |  |-- dynamoDB_service.py
 |
 |-- __init__.py
 |-- main.py
 |-- logger.py

main.py is the entrypoint script that imports other scripts from above directories. The above code runs perfectly fine on dev-endpoint, after uploading on the ETL cluster created by dev endpoint. Since now I want to run it in production, I want to create a proper glue job for it. But when I compress the whole directory my_etl_job in .zip format, upload it in artifacts s3 bucket, and specify the .zip file location into script location as follows

s3://<bucket_name>/etl_jobs/my_etl_job.zip

This is the code I see on glue job UI dashboard;

PK
    ���P__init__.pyUX�'�^"�^A��)PK#7�P  logger.pyUX��^1��^A��)]�Mk�0����a�&v+���A�B���`x����q��} ...AND ALLOT MORE...

Seems like the glue job doesn't accepts .zip format ? if yes, then what compression format shall I use ?

UPDATE: I checked out that glue job has option of taking in extra files Referenced files path where I provided a comma separated list of all paths of the above files, and changed the script_location to refer to only main.py file path. But that also didn't worked. Glue job throws error no module found logger (and I defined this module inside logger.py file)

Americana answered 14/4, 2020 at 21:50 Comment(1)
Have you tried this aws.amazon.com/premiumsupport/knowledge-center/… ?Bryonbryony
T
13

You'll have to pass the zip file as extra python lib , or build a wheel package for the code package and upload the zip or wheel to s3, provide the same path as extra python lib option

Note: Have your main function written in the glue console it self , referencing the required function from the zipped/wheel dependency, you script location should never be a zip file

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

Therontheropod answered 15/4, 2020 at 15:22 Comment(1)
That's correct. I provided script_location: s3://<bucket_name>/etl_jobs/main.py and --extra-py-files s3://<bucket_name>/etl_jobs/my_etl_job.zip and it worked.Americana
M
8

I'm using Glue v2.0 using the Spark job type (rather than Python shell) and had a similar issue.

In addition to the previous answers regarding zip files, which discuss:

  • The main.py should not be zipped.
  • The .zip file archive corelib.zip (or services.zip) should contain corelib (or services) folder and its contents.

I followed this and was still getting ImportError: No module named errors when trying to import my module.

After adding the following snippet to my Glue Job script:

import sys
import os

print(f"os.getcwd()={os.getcwd()}")
print(f"os.listdir('.')={os.listdir('.')}")

print(f"sys.path={sys.path}")

I could see that the current working directory contained my zip file.

But sys.path did not include the current working directory.

So Python was unable to import my zip file, resulting in a ImportError: No module named error.

To resolve the import issue, I simply added the following code to my Glue Job script.

import sys
sys.path.insert(0, "utils.zip")

import utils

For reference: The contents of my utils.zip

Archive:  utils.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
       0  Defl:N        5   0% 01-01-2049 00:00 00000000  __init__.py
    6603  Defl:N     1676  75% 01-01-2049 00:00 f4551ccb  utils.py
--------          -------  ---                            -------
    6603             1681  75%                            2 files

(Note that __init__.py must be present for a module import to work)

My local project folder structure

my_job_stuff
 |-- utils
 |   |-- __init__.py
 |   |-- utils.py
 |-- main.py
Mcdade answered 11/3, 2022 at 9:54 Comment(1)
My current observations with Glue 2.0 and 3.0 -- there event doesn't have to be an __init__.py in the .zip root (only in zipped packages, i.e. directories with modules), and I don't need to do sys.path.insert(...), but I have to pass the zip as --extra-py-files not --additional-python-modulesEmanative
L
1
  1. You main job should not be zipped. That should be a py file itself. In this case this would be you main.py. This should not be part of the zip file.
  2. Any additional library files you refer to in your code can be zipped or made as a wheel file and referred to via the extra-files option. Your folder structure can be slightly modified to hold all these extra py files you refer to in main, would be better of being like below. If you have more services, consider breaking it down even furthur but below is a simple example
my_etl_job
 |
 |--corelib
 |  |
 |  |--__init__.py
 |  |-- services
 |      |
 |      | -- dynamoDB_service.py
 |      | -- logger.py
 |
 |-- main.py

You can then import your dynamodbservices module in main.py as corelib.services.dynamoDB_service. When you prepare your library,just go to folder before corelib and zip up the folder like below

zip -r corelib.zip corelib/

You can then add the crelib.zip as your extra files in glue. (You can prepare a wheel file to.its your preference)

Lancelle answered 15/4, 2020 at 20:14 Comment(1)
Thanks Emerson for the clues. Yes followed the same, except I kept the directory structure as it is, and it worked. See my comment on karan's answer.Americana

© 2022 - 2024 — McMap. All rights reserved.