Is there a temporary folder that I can access while using AWS Glue?
Asked Answered
S

4

12

Is there a temporary folder that I can access to hold files temporarily while running processes within AWS glue? For example, in Lambda we have access to a /tmp directory as long as the process is executing. Do we have something similar in AWS Glue that we can store files while the job is executing?

Symphony answered 12/1, 2018 at 18:29 Comment(0)
G
8

Are you asking for this? There are a number of argument names that are recognized and used by AWS Glue, that you can use to set up the script environment for your Jobs and JobRuns:

  • --TempDir — Specifies an S3 path to a bucket that can be used as a temporary directory for the Job.

Here is a link, which you can refer.

Hope, this helps.

Gooseherd answered 17/1, 2018 at 10:15 Comment(1)
Hey, thanks for the response. It's not exactly what I'm looking for. I was hoping to have a temp dir local to the system running the process as using the S3 path will add the overhead of uploading and downloading the file.Symphony
S
4

Yes, there is a tmp directory which you can use to move files to and from s3.

s3 = boto3.resource('s3')

--Downloads file to local spark directory tmp

s3.Bucket(bucket_name).download_file(DATA_DIR+file,'tmp/'+file)

And you can also upload files from 'tmp/' to s3.

Stillbirth answered 31/7, 2018 at 2:59 Comment(4)
I think the current working directory for glue startup is already set to 'tmp', therefore the prepending of 'tmp/' is unnecessary.Challenge
This directory is usable but you won't be able to load local files using sparkBufflehead
What is the max capacity of this tmp? Lambda tmp is 500 MB max, I've a use case of using tmp of upto 2 GB at a time. Any help?Riba
It's an s3 bucket, so no limit @AakashBasuAzazel
I
4

OP clarified in a comment

I was hoping to have a temp dir local to the system

My experience in Oct 2023...

For AWS Glue 4.0 Spark jobs (not tested with lower Glue versions or with Python Shell jobs), the folder /tmp is usable. Note that this is NOT the temporary location that you specify in the job details tab (that location is in S3).

I have successfully used /tmp to extract a large (9GB) CSV file from a zip archive before uploading it to S3.

How much space is available?
The table in this AWS post lists disk sizes for different worker types but it's unclear how much of this is actually available to jobs. As I said, I've gone up to 9GB.

Incense answered 5/10, 2023 at 17:38 Comment(4)
Were you able to read the file in /tmp by spark. I got a resource error when I tried to read it in. My,use case is to use /tmp for converting unspilttwble gzip to bz2Anitraaniweta
@Anitraaniweta I haven't tried reading from /tmp into a dataframe. My script does the following: 1) download a zip file from a URL into /tmp 2) extract and decompress one of the files from the zip archive, saving it to /tmp 3) upload the extracted file from /tmp to an S3 bucket 4) create a dataframe by reading in from the s3 location. I'm no Spark expect but my guess about your error would be that the /tmp location is on the driver whereas when actually creating a dataframe the location needs to be accessible to all the executors.Incense
I think you are right. When I saved the file back to a temp location on s3, all worked fine. So, presumably when I read from s3, all workers are reading in parallel vs. when I read from /tmp, it is only accessbile to the workers on the driver. Not sure how to make a tmp accessbile to all?Anitraaniweta
/tmp is also usable in Pyhon Shell jobsCarbonaceous
M
1

For anyone looking for an answer to this for pythonshell jobs. Yes, this is possible. I've been using the /tmp folder with Python 3.9 jobs. But the disk space is somewhat limited. The doc says:

You can set the value to 0.0625 or 1. The default is 0.0625. In either case, the local disk for the instance will be 20GB.

However, this test suggests that around 5GB of those 20GB are already in use.

Mcnully answered 18/4 at 17:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.