I have an error "java.io.FileNotFoundException: No such file or directory" while trying to create a dynamic frame using a notebook in AWS Glue
Asked Answered
O

2

9

I'm setting up a new Jupyter Notebook in AWS Glue as a dev endpoint in order to test out some code for running an ETL script. So far I created a basic ETL script using AWS Glue but, for some reason, when trying to run the code on the Jupyter Notebook, I keep getting a FileNotFoundException.

I'm using a table (in the data catalog) that was created by an AWS Crawler to fetch the information associated with an S3 bucket and I'm able to actually get the filenames inside the bucket, but when I try to read the file using the dynamic frame, an FileNotFoundException is thrown.

Has anyone ever had this issue before?

This is running on N.Virginia AWS account. I've already set up the permissions, granted IAM roles to the AWS Glue service, setup the VPC endpoints and tried running the Job directly in AWS Glue, to no avail.

This is the basic code:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "xxx-database", table_name = "mytable_item", transformation_ctx = "datasource0")

datasource0.printSchema()
datasource0.show()

Alternatively:

datasource0 = glueContext.create_dynamic_frame.from_options('s3', connection_options={"paths":["s3://my-bucket/92387979/My-Table-Item/2016-09-11T16:30:00.000Z/"]}, format="json", transformation_ctx="")


datasource0.printSchema()
datasource0.show()

I would expect to receive a dynamic frame content, but this is actually throwing this error:

An error occurred while calling o343.schema.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 23, ip-172-31-87-88.ec2.internal, executor 6): java.io.FileNotFoundException: No such file or directory 's3://my-bucket/92387979/My-Table-Item/2016-09-11T16:30:00.000Z/92387979-My-Table-Item-2016-09-11T16:30:00.000Z.json.gz'
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:826)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1206)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
    at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.initialize(TapeHadoopRecordReader.scala:99)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:182)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

Thanks in advance for any help given.

Oleaceous answered 9/7, 2019 at 18:43 Comment(4)
Sounds like an s3 permissions issue from a quick glance, could you post them?Interinsurance
Looks like some issue with the table and/or S3 permissions. Could you provide the details of your create table command and S3 permissions on the said bucket?Nipper
Yep, it was an S3 permissions issue. I also did a test with Boto3 to get the object using the S3 client and it gave me access denied. Thanks to both for pointing it out.Oleaceous
I have all the IAM permissions but still facing the issueMorphine
O
15

Well, as Chris D'Englere and Harsh Bafna pointed out, it was indeed a permission's issue. As it turns out, I forgot to add specific S3 permissions for the objects (GetObject) inside the bucket and not only to the bucket itself.

Thanks for the help!

Oleaceous answered 10/7, 2019 at 10:15 Comment(1)
what is the fix, permission wise?Taxidermy
C
3

The issue is with S3 permissions in all likelihood.

Go to IAM and add s3:GetObject permissions to the policy that is attached to the role that Glue is using. Make sure to specify the specific S3 bucket for the resource associated with the policy.

I had the same issue, tried what I described above, and now it's working.

Cyclopean answered 21/11, 2022 at 10:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.