AWS Glue PySpark replace NULLs
Asked Answered
K

3

7

I am running an AWS Glue job to load a pipe delimited file on S3 into an RDS Postgres instance, using the auto-generated PySpark script from Glue.

Initially, it complained about NULL values in some columns:

pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"

After some googling and reading on SO, I tried to replace the NULLs in my file by converting my AWS Glue Dynamic Dataframe to a Spark Dataframe, executing the function fillna() and reconverting back to a Dynamic Dataframe.

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = 
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx = 
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = custom_df2.fromDF()

applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id", 
"string", "id", "int"),........more code

References:

https://github.com/awslabs/aws-glue-samples/blob/master/FAQ_and_How_to.md#3-there-are-some-transforms-that-i-cannot-figure-out

How to replace all Null values of a dataframe in Pyspark

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna

Now, when I run my job, it throws the following error:

Log Contents:
Traceback (most recent call last):
File "script_2017-12-20-22-02-13.py", line 23, in <module>
custom_df3 = custom_df2.fromDF()
AttributeError: 'DataFrame' object has no attribute 'fromDF'
End of LogType:stdout

I am new to Python and Spark and have tried a lot, but can't make sense of this. Appreciate some expert help on this.

I tried changing my reconvert command to this:

custom_df3 = glueContext.create_dynamic_frame.fromDF(frame = custom_df2)

But still got the error:

AttributeError: 'DynamicFrameReader' object has no attribute 'fromDF'

UPDATE: I suspect this is not about NULL values. The message "Can't get JDBC type for null" seems not to refer to a NULL value, but some data/type that JDBC is unable to decipher.

I created a file with only 1 record, no NULL values, changed all Boolean types to INT (and replaced values with 0 and 1), but still get the same error:

pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"

UPDATE: Make sure DynamicFrame is imported (from awsglue.context import DynamicFrame), since fromDF / toDF are part of DynamicFrame.

Refer to https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html

Koontz answered 20/12, 2017 at 23:25 Comment(2)
Can it be because of the column data types?. I remember sth like fillna() and dropna() is affecting only to columns which have string data type by default.. So for example, if the column type is date it can still include null values after fillna() and dropna()Iterative
Have you tried using DropNullFields? It removes null fields from a DynamicFrame. The output DynamicFrame does not contain fields of the null type in the schema. This way you can keep only using Dynamic Frames, which are fine-tuned for Glue jobs, avoiding converting to/from data frames.Carburetor
P
2

You are calling .fromDF on the wrong class. It should look like this:

from awsglue.dynamicframe import DynamicFrame
DyamicFrame.fromDF(custom_df2, glueContext, 'label')
Pome answered 26/4, 2018 at 14:48 Comment(0)
G
2

For this error, pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null", you should drop Null columns.

I was getting similar errors while loading to Redshift tables. After using the below command, the issue got resolved:

from awsglue.transforms import DropNullFields
loading= DropNullFields.apply(frame = resolvechoice3, transformation_ctx = "loading")
Gurule answered 30/7, 2018 at 9:38 Comment(0)
G
0

In Pandas, and for Pandas DataFrame, pd.fillna() is used to fill null values with other specified values. However, DropNullFields drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null values in every record in the DynamicFrame data set.

In your specific situation, you need to make sure you are using the write class for the appropriate dataset.

Here is the edited version of your code:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = 
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx = 
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = DyamicFrame.fromDF(custom_df2, glueContext, 'your_label')

applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id", 
"string", "id", "int"),........more code

This is what you are doing: 1. Read the file in DynamicFrame, 2. Convert it to DataFrame, 3. Drop null values, 4. Convert back to DynamicFrame, and 5. ApplyMapping. You were getting the following error because your step 4 was wrong and you were were feeding a DataFrame to ApplyMapping which does not work. ApplyMapping is designed for DynamicFrames.

I would suggest read your data in DynamicFrame and stick to the same data type. It would look like this (one way to do it):

from awsglue.dynamicframe import DynamicFrame

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = 
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx = 
"datasource0")
custom_df = DropNullFields.apply(frame=datasource0)
applymapping1 = ApplyMapping.apply(frame = custom_df, mappings = [("id", 
"string", "id", "int"),........more code
Giblet answered 16/3, 2022 at 4:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.