Extra files are not copied to job run directory
Asked Answered
C

2

5

I am trying a simple python shell job where I am trying to read a config file which is in S3 bucket folder. The Glue service role has bucket object read/write permission. I have set --extra-files special parameter to point it to the config file S3 location.

When I run a job, I still get FileNotFound exception. I also used listdir() to see the content and noticed that the config file is missing.

Any help is much appreciated. Thanks

import os
import yaml

print(os.listdir("."))

file_path = "config_aws.yaml"
with open(file_path, 'r') as configfile:
    config = yaml.load(configfile, Loader=yaml.FullLoader)

for section in config:
    print(section)
Camembert answered 4/8, 2019 at 3:10 Comment(4)
Could you share how you are invoking the Glue script? For instance, something along the lines of aws glue start-job-run --job-name "CSV to CSV" --arguments='--scriptLocation="s3://<bucket>/<prefix>/test_lib.py"--extra-files="s3://<bucket>/<prefix>/config-aws.yaml"'.Penknife
I am running it through AWS Console specifying the object in "Referenced files path" parameter. My path looks like this: s3://aws-glue-scripts-123123123123-us-east-1/root/config_aws.yamlCamembert
Now I have printed the content of each of the folder glue copies in the working directory. Here is a list: Top level folder content: ['bin', 'lib', 'runscript.py', 'include', 'glue-python-libs-bg9qrzh5', 'glue-python-scripts-opdtqked'] When I see the content of 'glue-python-scripts-opdtqked': ['test43210.py'] When I see the content of 'glue-python-libs-bg9qrzh5': ['config_aws.yaml'] So the file is there but I was expecting config_aws.yaml in the same directory as the script file based on the documentation. Am I missing something? Any help would be appreciated.Camembert
I just think that this is absolutely strange if this is the way its working as everytime script is finding out the libs folder name and referring the files within it due to the random token at the end of folder name.Camembert
B
7

I'm facing the same issue. I found that the file is under a directory named glue-python-libs-....

So, I had to do what follows (horrible solution btw):

config_dir = [f for f in os.listdir("./") if f.startswith("glue-python-libs-")][0]
config_file = f"{config_dir}/config.json"
Binoculars answered 23/8, 2019 at 15:39 Comment(2)
yeah this is what I had done and that made me comment earlier that this approach (if at all is what AWS Glue team recommended solution) is very strange.Camembert
the other (horrible approach as well) is to read if from s3 :(Binoculars
U
2

I know this question is over 3 years old and AWS Glue has moved on, but you can currently determine the location of any --extra-files (for Python shell Glue Jobs) by looking at the OS environment variable EXTRA_FILES_DIR e.g.

import os
extra_files_dir = os.environ['EXTRA_FILES_DIR']

In my case, the files had been copied to /tmp/glue-python-libs-IbWD

Hope this helps someone.

Unhappy answered 14/12, 2022 at 10:27 Comment(5)
I get KeyError EXTRA_FILES_DIR.Lineup
os.environ['EXTRA_FILES_DIR'] works for my Python Shell Glue jobs, but may not work for other Glue job types. What type of Glue Job do you have @Simon30?Unhappy
It's actually on Python shell jobs that it failed. I had to list environment variables and couldn't find this one... I use python 3.7Lineup
I fixed this using glob for the file i want to loadLineup
I'm using Python v3.9 and GlueVersion v3.0 (in CFN template) and passing some files in the Glue job definition via --extra-files called Referenced files path in the Glue console.Unhappy

© 2022 - 2024 — McMap. All rights reserved.