What options can be passed to AWS Glue DynamicFrame.toDF()?
Asked Answered
M

2

6

The documentation on toDF() method specifies that we can pass an options parameter to this method. But it does not specify what those options can be (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html). Does anyone know if there is further documentation on this? I am specifically interested in passing in a schema when creating a DataFrame from DynamicFrame.

Malpighi answered 5/10, 2020 at 19:54 Comment(0)
S
1

Unfortunately there's not much documentation available, yet R&D and analysis of source code for dynamicframe suggests the following:

  • options available in toDF have more to do with ResolveOption class then toDF itself, as ResolveOption class adds meaning to the parameters (please read the code).
  • ResolveOption class takes in ChoiceType as a parameter.
  • The options examples available in documentation are similar to the specs available in ResolveChoice that also mention ChoiceType.
  • Options are further converted to sequence and referenced to toDF function from _jdf here.

My understanding after seeing the specs, toDF implementation of dynamicFrame and toDF from spark is that we can't pass schema when creating a DataFrame from DynamicFrame, but only minor column manipulations are possible.

Saying this, a possible approach is to obtain a dataframe from dynamic frame and then manipulate it to change its schema.

Schiller answered 8/10, 2020 at 9:50 Comment(0)
P
0

The documentation is very unclear. It states:

options – A list of options. Specify the target type if you choose the Project and Cast action type. Examples include the following.

toDF([ResolveOption("a.b.c", "KeepAsStruct")]) toDF([ResolveOption("a.b.c", "Project", DoubleType())])

However 'Cast' is not an allowed action from what I can tell, and ResolveOption is just the name of a tuple that they expect you to define which adheres to their attribute structure.

So here is an example of what to pass to dynamicframe toDF() in python:

from awsglue.dynamicframe import DynamicFrame
from awsglue.gluetypes import *
from collections import namedtuple
#any other imports you need..



# Define a named tuple called ResolveOption with attributes 'path', 'action', and 'target'

ResolveOption = namedtuple('ResolveOption', ['path', 'action', 'target'])

#Create an array of ResolveOption tuples 
#(Good for when converting to a DataFrame and you need to project the data types for your schema so you don't end up with unresolved JSON values like {int:111, double:null} etc)
#action must be one of KeepAsStruct and Project
#target should be types such as (for example) StringType(), DoubleType(), etc..

ResolveOptions = [
    ResolveOption(path="columnname", action="Project", target=StringType()),
    ....
]

#Assuming you created a dynamic frame named YourDynamicFrame earlier

YourDataFrame = YourDynamicFrame.toDF(ResolveOptions)

Tested and works. Hope this helps

Paphos answered 8/3 at 12:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.