I have some data stored in an S3 bucket in parquet format, following a hive-like partitioning style, with these partition keys: retailer - year - month - day.
Eg
my-bucket/
retailer=a/
year=2020/
....
retailer=b/
year=2020/
month=2/
...
I wanna read all this data in a sagemaker notebook and I want to have the partitions as columns of my DynamicFrame, so that when I df.printSchema()
, they are included.
If I use Glue's suggested method, the partitions don't get included in my schema. Here's the code I'm using:
df = glueContext.create_dynamic_frame.from_options(
connection_type='s3',
connection_options={
'paths': ['s3://my-bucket/'],
"partitionKeys": [
"retailer",
"year",
"month",
"day"
]
},
format='parquet'
)
By using normal spark code and the DataFrame class, instead, it works, and the partition get included in my schema:
df = spark.read.parquet('s3://my-bucket/')
.
I wonder if there is a way to do it with AWS Glue's specific methods or not.