I too received this super helpful error message.
What worked for me was explicitly setting properties like worker type, number of workers, Glue version and Python version.
In Terraform code:
resource "aws_glue_job" "my_job" {
name = "my_job"
role_arn = aws_iam_role.glue.arn
worker_type = "Standard"
number_of_workers = 2
glue_version = "4.0"
command {
script_location = "s3://my-bucket/my-script.py"
python_version = "3"
}
default_arguments = {
"--enable-job-insights" = "true",
"--additional-python-modules" : "boto3==1.26.52,pandas==1.5.2,SQLAlchemy==1.4.46,requests==2.28.2",
}
}
Update
After doing some more digging, I realised that what I needed was a Python shell script Glue job, not an ETL (Spark) job. By choosing this flavour of job, setting the Python version to 3.9 and "ticking the box" for Glue's pre-installed analytics libraries, my script, incidentally, had access to all the libraries I needed.
My Terraform code ended up looking like this:
resource "aws_glue_job" "my_job" {
name = "my-job"
role_arn = aws_iam_role.glue.arn
glue_version = "1.0"
max_capacity = 1
connections = [
aws_glue_connection.redshift.name
]
command {
name = "pythonshell"
script_location = "s3://my-bucket/my-script.py"
python_version = "3.9"
}
default_arguments = {
"--enable-job-insights" = "true",
"--library-set" : "analytics",
}
}
Note that I have switched to using Glue version 1.0. I arrived at this after some trial and error, and could not find this explicitly stated as the compatible version for pythonshell
jobs… but it works!