Amazon Redshift - COPY from CSV - single Double Quote in row - Invalid quote formatting for CSV Error
Asked Answered
T

3

12

I'm loading a CSV file from S3 into Redshift. This CSV file is analytics data which contains the PageUrl (which may contain user search info inside a query string for example).

It chokes on rows where there is a single, double-quote character, for example if there is a page for a 14" toy then the PageUrl would contain:

http://www.mywebsite.com/a-14"-toy/1234.html

Redshift understandably can't handle this as it is expecting a closing double quote character.

The way I see it my options are:

  1. Pre-process the input and remove these characters
  2. Configure the COPY command in Redshift to ignore these characters but still load the row
  3. Set MAXERRORS to a high value and sweep up the errors using a separate process

Option 2 would be ideal, but I can't find it!

Any other suggestions if I'm just not looking hard enough?

Thanks

Duncan

Turnheim answered 6/8, 2015 at 16:52 Comment(1)
I know this is a year old, but any updates where you can find an answer for #2?Worked
S
11

Unfortunately, there is no way to fix this. You will need to pre-process the file before loading it into Amazon Redshift.

The closest options you have are CSV [ QUOTE [AS] 'quote_character' ] to wrap fields in an alternative quote character, and ESCAPE if the quote character is preceded by a slash. Alas, both require the file to be in a particular format before loading.

See:

Stacte answered 7/8, 2015 at 10:21 Comment(3)
Thanks John. Yeah I've resigned myself to writing a little PIG job to precede all " with \" which should do the trick.Turnheim
Depending upon the size of your data, you could just run things through sed. You could even do it as a streaming job through aws s3 cp s3://b/f - | sed xxx | aws s3 cp - s3://b/f2. See streaming on docs.aws.amazon.com/cli/latest/reference/s3/cp.htmlStacte
@Uranalysis 's answer below provides a solution without pre-processingEpigastrium
U
17

It's 2017 and I run into the same problem, happy to report there is now a way to get redshift to load csv files with the odd " in the data.

The trick is to use the ESCAPE keyword, and also to NOT use the CSV keyword. I don't know why, but having the CSV and ESCAPE keywords together in a copy command resulted in failure with the error message "CSV is not compatible with ESCAPE;" However with no change to the loaded data I was able to successfully load once I removed the CSV keyword from the COPY command.

You can also refer to this documentation for help: http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-escape

Uranalysis answered 26/9, 2017 at 7:14 Comment(2)
Thanks a lot for your response, this saved my day 2 years later!Susannesusceptibility
FORMAT AS CSV only can handle newline characters in the file - how can we handle that?Maker
S
11

Unfortunately, there is no way to fix this. You will need to pre-process the file before loading it into Amazon Redshift.

The closest options you have are CSV [ QUOTE [AS] 'quote_character' ] to wrap fields in an alternative quote character, and ESCAPE if the quote character is preceded by a slash. Alas, both require the file to be in a particular format before loading.

See:

Stacte answered 7/8, 2015 at 10:21 Comment(3)
Thanks John. Yeah I've resigned myself to writing a little PIG job to precede all " with \" which should do the trick.Turnheim
Depending upon the size of your data, you could just run things through sed. You could even do it as a streaming job through aws s3 cp s3://b/f - | sed xxx | aws s3 cp - s3://b/f2. See streaming on docs.aws.amazon.com/cli/latest/reference/s3/cp.htmlStacte
@Uranalysis 's answer below provides a solution without pre-processingEpigastrium
I
-1

I have done this using ---> DELIMITER ',' IGNOREHEADER 1; at the replacement for 'CSV' at the end of COPY command. Its working really fine.

Informal answered 14/9, 2022 at 7:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.