AWS Glue - GlueContext: read partitioned data from S3, add partitions as columns of DynamicFrame

Asked 26/2, 2020 at 11:50 Answered 1/12, 2023 at 15:6

I have some data stored in an S3 bucket in parquet format, following a hive-like partitioning style, with these partition keys: retailer - year - month - day.

my-bucket/
   retailer=a/
         year=2020/
         ....
   retailer=b/
         year=2020/
            month=2/
         ...

I wanna read all this data in a sagemaker notebook and I want to have the partitions as columns of my DynamicFrame, so that when I df.printSchema(), they are included.

If I use Glue's suggested method, the partitions don't get included in my schema. Here's the code I'm using:

df = glueContext.create_dynamic_frame.from_options(
    connection_type='s3',
    connection_options={
        'paths': ['s3://my-bucket/'],
        "partitionKeys": [
            "retailer",
            "year",
            "month",
            "day"
        ]
    },
    format='parquet'
)

By using normal spark code and the DataFrame class, instead, it works, and the partition get included in my schema:

df = spark.read.parquet('s3://my-bucket/').

I wonder if there is a way to do it with AWS Glue's specific methods or not.

Leukocyte answered 26/2, 2020 at 11:50 Comment(0)

maybe u could try crawling the data and read it using The from_catalog option. Although I would think U don’t need to mention the partition keys since it should see that = means it’s a partition. Especially considering glue is just a wrapper around spark

Lactose answered 26/2, 2020 at 17:43 Comment(1)

I later tried both without and with mentioning the partition but it makes zero difference. I ended up using just spark.read, it was a one time only operation so it didn't need to be too complex. I was just curious if Glue could handle this case. – Leukocyte 27/2, 2020 at 10:7

Assuming you're trying to read the data as DynamicFrame because of the advantages Job Bookmarks provide, here's a possible workaround:

Have Athena as a source of data with partitions displayed as columns through Glue Crawlers
Join it to the same table as DynamicFrame to retrieve the partition columns.

Vivid answered 1/12, 2023 at 15:6 Comment(0)

Recommended topics

Hot tags