Using S3 folder structure as meta data in AWS Glue

So it turns out AWSGlue/pyspark does have this feature but it requires a little data wrangling and use of the scripting feature in aws glue jobs.

You can use the input_file_name function to get the full file path. This can be mapped to a column like so:

ApplyMapping_node2 = ApplyMapping_node1.toDF().withColumn("path", input_file_name())
ApplyMapping_node3 = ApplyMapping_node2.fromDF(ApplyMapping_node2, glueContext, "ApplyMapping_node2")

However, you if you need to split the path to get a specific file name you can do something like this:

ApplyMapping_node2 = ApplyMapping_node1.toDF().withColumn("path", input_file_name())
ApplyMapping_node3 = ApplyMapping_node2.withColumn("split_path", split_path_UDF(ApplyMapping_node3['path']))
ApplyMapping_node4 = ApplyMapping_node1.fromDF(ApplyMapping_node3, glueContext, "ApplyMapping_node4")

Where the split_path function is setup as a udf. Like so:

from pyspark.sql.functions import input_file_name, udf

def split_path(path):
    return path.split('/')[-1]

split_path_UDF = udf(split_path, StringType())

Recommended topics

Hot tags