AWS Glue: Removing quote character from a CSV file while writing
Asked Answered
P

2

7

I have a csv file in S3, which does not have any quotes. eg.

dVsdfsCcn7j6,r:werwerwerwerwerwerwerwer,_User$SSSSSBFwJ,login,password,false,2011-10-27 10:46:55,d24c2465e-9945645c5-4645509-a745741c7-ba9bcd1a7cfd,2046-11-27 15:46:55,2016-10-27 10:46:55

When I give this file as an input to the glue script / job (in which I intend to remove the _User prefix), the ETL output creates a csv file which has quotes attached to some attributes, eg:

dVsdfsCcn7j6,"r:werwerwerwerwerwerwerwer","_User$SSSSSBFwJ",login,password,false,"2011-10-27 10:46:55","d24c2465e-9945645c5-4645509-a745741c7-ba9bcd1a7cfd","2046-11-27 15:46:55","2016-10-27 10:46:55"

I have set the quoteChar to -1 as per the documentation, https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html

datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://test-bucket/staging/final_data"}, format = "csv", format_options = {"quoteChar":'-1'}, transformation_ctx = "datasink2")

My end goal is to make this csv data queryable using Athena.

Has anyone faced such an issue during transformation? Is it something related to the SerDe information of the data catalog? Any help will be appreciated!

Thanks.

Link to question posted on AWS forums - https://forums.aws.amazon.com/thread.jspa?threadID=278073&tstart=0

Perspicacious answered 27/3, 2018 at 4:8 Comment(2)
Worked for me using format_options = { "quoteChar": -1 } with no ' around the -1Nympho
I had tried without the quotes as well at that time, but didn't work. AWS might have fixed/implemented it now.Perspicacious
C
7

I've been having the same problem. The from_options quoteChar setting seems to be ignored. I worked around this by using a DataFrame directly to write the output to s3.

applymapping1.toDF() \
    .write
    .option("quote"," ") \
    .csv("s3://bucket/path/to/output/folder")

This seems a little hacky but works for now until there's an answer from AWS.

Concertgoer answered 3/4, 2018 at 19:37 Comment(3)
Thanks David. That worked. I had read a similar answer where someone used '\u000', but I wanted to find out the offical 'Glue' way of doing it.Perspicacious
Ha, who knew. I was going insane. Luckily I stumbled on this.Lambda
@PriyankKapasi we faced the same situation. Do you know how you guys printed \u000? We're trying to discover the reason even after solve the problem, just curious about it.Mahala
P
0

Add a Transforms > Custom Transform NameOfCustomNode where you can write custom code like the above snippet. Add a parent node NameOfParentNode. Take a look in the Script tab now, and you can see your custom transform being called like this:

# Script generated for node NameOfCustomNode
NameOfCustomNode_node1234 = MyTransform(
    glueContext,
    DynamicFrameCollection({"NameOfParentNode": NameOfParentNode}, glueContext),
)

And you can then access the parent node (frame) in the dfc and write it to S3:

def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
    dfc['NameOfParentNode'].toDF().write.option("quote"," ").option("header","true").csv("s3://bucket/output/")
Paratuberculosis answered 13/10, 2023 at 9:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.