AWS Glue not copying id(int) column to Redshift - it's blank

query_df = spark.read.format("jdbc").option("url", args['RDSURL']).option("driver", args['RDSDRIVER']).option("dbtable", args['RDSQUERY']).option("user", args['RDSUSER']).option("password", args['RDSPASS']).load() datasource0 = DynamicFrame.fromDF(query_df, glueContext, "datasource0") logging.info(datasource0.show()) applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("id", "int", "id", "int"), ... , transformation_ctx = "applymapping1") logging.info(applymapping1.show())

datasink2 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = applymapping1, catalog_connection = args['RSCLUSTER'], connection_options = {"dbtable": args['RSTABLE'], "database": args['RSDB']}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink2")

I've had exactly this behaviour with extracting from MySQL RDS using Glue. For anyone seeking the answer to this - the reason is as follows: AWSGlue has the concept of a 'type choice' where the exact type of a crawled column can remain as a number of possibilities throughout the ETL Job, since the crawler only crawls a subset of a column's data to determine the probable type and doesn't decide definitively. This is why converting to use an explicit schema rather than a crawler will fix the issue as it doesn't involve any type choices.

When the job runs (or you look at a preview) Spark will attempt to process the entire column dataset. At this point it's possible that the column type is resolved to a type that is incompatible with the data set - i.e the interpreter can't decide on the right type choice, and this results in empty data for the column in question. I have experienced this in transforming a number of tables from a MySQL DB, and there's no apparent pattern on why some fail and some don't that I've been able to determine although it must be related to the data in the source DB column.

The solution is to add into your script an explicit resolution of the choices, by casting the column that's failing to the desired target type with something like the following:

df.resolveChoice(specs = [('id', 'cast:int')])

Where df is the data frame. This will force the column to be interpreted as the intended type and should result in the expected output of data in this column. This has always worked for me.

Note that for those using the Glue Studio visual editor it's now possible to add a 'Custom Transformation' step which contains code to perform this for you. In this instance the code for the transformation should look as follows:

def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0])
df_resolved = df.resolveChoice(specs = [('id', 'cast:int')])
return (DynamicFrameCollection({"CustomTransform0": df_resolved}, glueContext))

Note also that in this scenario that it will be necessary to follow this Custom Transformation node with a 'Select from Collection' transformation since the Custom Transformation returns a collection rather than a single frame.

Recommended topics

Hot tags