Date fields transformation from AWS Glue table to RedShift Spectrum external table
Asked Answered
L

1

7

I am trying to transform the JSON dataset from S3 to Glue table schema into an Redshift spectrum for data analysis. While creating external tables, how to transform the DATE fields?

Need to highlight the source data is coming from MongoDB in ISODate format. Here, is the Glue table format.

  struct $date:string

Tried the following formats within the External table

startDate:struct<$date:varchar(40)>
startDate:struct<date:varchar(40)>
startDate:struct<date:timestamp>

Is there a work around within the Redshift Spectrum or Glue to handle ISODate formats? Or the recommendation is to go back to the source to convert the ISOdate format?

Locoism answered 19/3, 2019 at 21:59 Comment(0)
C
0

Assuming you are using Python in glue, and assuming python understands your field as a date, you could do something like:

from pyspark.sql.functions import date_format
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext


def out_date_format(to_format):
    """formats the passed date into MM/dd/yyyy format"""
    return date_format(to_format,"MM/dd/yyyy")

#if you have a dynamic frame you will need to convert it to a dataframe first:
#dataframe = dynamic_frame.toDF()

dataframe.withColumn("new_column_name", out_date_format("your_old_date_column_name"))

#assuming you are outputting via glue, you will need to convert the dataframe back into a dynamic frame:
#glue_context = GlueContext(spark_context)
#final = DynamicFrame.fromDF(dataframe, glue_context,"final")

Depending on how you are getting the data, there may be other options to use mapping or formatting. If python doesn't understand your field as a date object, you will need to parse it first, something like:

import dateutil.parser

#and the convert would change to:

def out_date_format(to_format):
    """formats the passed date into MM/dd/yyyy format"""
    yourdate = dateutil.parser.parse(to_format)
    return date_format(yourdate,"MM/dd/yyyy")

Note that if the dateutil isn't built into glue, you will need to add it to your job parameters with syntax like: "--additional-python-modules" = "python-dateutil==2.8.1"

Correia answered 5/5, 2021 at 18:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.