AWS Glue not copying id(int) column to Redshift - it's blank
Asked Answered
A

1

7

Having a very weird problem with Glue. Using it to run some ETL on data I'm moving from MySQL RDS to Redshift. Using the same code I used on another table, where it worked fine and copied all the data as it should have.

However on the second table, for some reason it doesn't copy the data in the id column from MySQL. The id column on Redshift is completely blank.

query_df = spark.read.format("jdbc").option("url", 
args['RDSURL']).option("driver", 
args['RDSDRIVER']).option("dbtable", 
args['RDSQUERY']).option("user", args['RDSUSER']).option("password", 
args['RDSPASS']).load()

datasource0 = DynamicFrame.fromDF(query_df, glueContext, 
"datasource0")

logging.info(datasource0.show())

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = 
[("id", "int", "id", "int"), ... , transformation_ctx = 
"applymapping1")

logging.info(applymapping1.show())

From the above logs I print above I can see that the Dynamic Frame contains the id field even after ApplyMapping.

datasink2 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = 
applymapping1, catalog_connection = args['RSCLUSTER'], 
connection_options = {"dbtable": args['RSTABLE'], "database": 
args['RSDB']}, redshift_tmp_dir = args["TempDir"], 
transformation_ctx = "datasink2")

The problem seems to happening here I think? After this the job completes, on checking Redshift the id column is completely empty.

Very puzzled by this behaviour. The exact code worked fine on another table, the only differences between the id in these two tables is that this table has id as int (11) unsigned while the table where the code worked had id as int (10) signed.

Arianaariane answered 31/1, 2019 at 19:39 Comment(14)
Can you remove data type of source column in apply mapping as [("id", "id", "int")] and retry loading the data again.If it's not working check the schema and data immediately before writing the data to redshiftAshy
I did do the latter, i.e. I printed the contents of applymapping1 to the logs and it did contain the id values. For some reason glueContext.write_dynamic_frame is not writing the id in to the Redshift table. In any case, I've decided to move away from Glue for the time being in favor of Lambda and EC2.Arianaariane
How do add signed/unsigned in redshift? Redshift doesn't have such datatypeAtor
@SandeepFatangare Thats not the point. MySQL does, and if glue can’t convert data types to match that of Redshift it’s pretty useless. In comparison, the pandas-redshift package in python could move all the data. It could even create a table in Redshift with appropriate data types for columns based on the definition of the table I pulled from MySQL.Arianaariane
Not sure if bigint will work instead of int. Unsigned will increase int limit which may not be case in redshift as there is no unsigned int datatype in redshift. So safer bet is to go to next higher datatype. You may use pandas-redshift with new feature added in glue ' python shell. docs.aws.amazon.com/glue/latest/dg/aAtor
@SandeepFatangare It didn’t work, that was one of the first things I tried. Also I doubt data type is the issue. I cast id as a signed int in the MySQL query and that didn’t work either.Arianaariane
Ohh, then it is better to raise AWS support ticket and get more details from them.Ator
Did you ever solve this? I'm seeing the same issue.This
Nope, as I'd said in another comment I'd moved in favour of Lambda which worked fine as I wan't working with too much data. If you have a lot of data I'd recommend having a look at EMR Spark.Arianaariane
Alrighty, thanks. I have needs for large data sets to be moved around. I have tried writing something by hand, but I don't want to this to become a maintenance nightmare. :)This
Suggestion of Prabhakar Reddy worked for me : " remove data type of source column in apply mapping as [("id", "id", "int")] "Offshore
Did anyone found a solution to this, I tried the above approach and but still not working, running into the same error, any help would be appreciated. Thanks TaskSetManager: Lost task 2.0 in stage 7.0 (TID 55, 10.5.7.234, executor 1): java.sql.BatchUpdateException: Column 'ID' cannot be null at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) atBeveridge
I was able to resolve this issue by making the ID column not part of the primary key. If the ID column is primary key then the data wouldn't get published from the source table. I guess it makes sense.Beveridge
Try explicitly casting the id column to the target datatypeTrainee
K
2

I've had exactly this behaviour with extracting from MySQL RDS using Glue. For anyone seeking the answer to this - the reason is as follows: AWSGlue has the concept of a 'type choice' where the exact type of a crawled column can remain as a number of possibilities throughout the ETL Job, since the crawler only crawls a subset of a column's data to determine the probable type and doesn't decide definitively. This is why converting to use an explicit schema rather than a crawler will fix the issue as it doesn't involve any type choices.

When the job runs (or you look at a preview) Spark will attempt to process the entire column dataset. At this point it's possible that the column type is resolved to a type that is incompatible with the data set - i.e the interpreter can't decide on the right type choice, and this results in empty data for the column in question. I have experienced this in transforming a number of tables from a MySQL DB, and there's no apparent pattern on why some fail and some don't that I've been able to determine although it must be related to the data in the source DB column.

The solution is to add into your script an explicit resolution of the choices, by casting the column that's failing to the desired target type with something like the following:

df.resolveChoice(specs = [('id', 'cast:int')])

Where df is the data frame. This will force the column to be interpreted as the intended type and should result in the expected output of data in this column. This has always worked for me.

Note that for those using the Glue Studio visual editor it's now possible to add a 'Custom Transformation' step which contains code to perform this for you. In this instance the code for the transformation should look as follows:

def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0])
df_resolved = df.resolveChoice(specs = [('id', 'cast:int')])
return (DynamicFrameCollection({"CustomTransform0": df_resolved}, glueContext))

Note also that in this scenario that it will be necessary to follow this Custom Transformation node with a 'Select from Collection' transformation since the Custom Transformation returns a collection rather than a single frame.

Kimikokimitri answered 16/8, 2021 at 13:32 Comment(1)
This fixed my issue. One thing I noticed in my MySQL database is that columns of type int would work, but columns of type int unsigned would always return null in the ApplyMapping node of Glue Studio.Displace

© 2022 - 2024 — McMap. All rights reserved.