Spark - CSV - Write Options - Quotes
Asked Answered
B

1

5

Hope everyone is doing well.

While going through the spark csv datasource options for the question I am quite confused on the difference between the various quote related options available.

spark-csv-quote-options

  1. Do we have any detailed differences between them ?
  2. Does any option override the other or they all work together ?

Used the example mentioned in the linked question to understand the differences, but still a little confused. Thank you for all the help.

Bortz answered 14/11, 2022 at 8:7 Comment(0)
I
6

Let's run through a few examples.

val df = Seq(
  (1, "aaa"),
  (2, "b,c"),
  (3, "d$e"),
  (4, "f%g")
).toDF("x", "y")
df.show

+---+---+
|  x|  y|
+---+---+
|  1|aaa|
|  2|b,c|
|  3|d$e|
|  4|f%g|
+---+---+

Whenever you have commas in values that would be indistinguishable from field delimiters (like 2,b,c) Spark can quote such values. It's double quotes by default (like 2,"b,c"), but you can customize it using quote option.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .csv("x")

1,aaa
2,$b,c$
3,$d\$e$
4,f%g

We used dollar sign here. Not only b,c was quoted, but since d$e contained a dollar, it was quoted as well and the dollar itself was escaped.

I don't know why you would like to do that, but you can ask Spark not to escape fields containing quote character using escapeQuotes option.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("escapeQuotes", false)
  .csv("x")

1,aaa
2,$b,c$
3,d$e
4,f%g

Here, d$e remained unquoted.

You can also force it to quote every field using quoteAll.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("quoteAll", true)
  .csv("x1")

$1$,$aaa$
$2$,$b,c$
$3$,$d\$e$
$4$,$f%g$

Note that quoteAll = true makes escapeQuotes irrelevant.

You can customize the escape character for the quote character inside quote using escape, if you don't like the default backslash.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("escape", "%")
  .csv("x1")

1,aaa
2,$b,c$
3,$d%$e$
4,$f%g$

Finally, note that there's a standalone escape character in f%g, so if you would like to escape that, use charToEscapeQuoteEscaping.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("escape", "%")
  .option("quoteAll", true)
  .option("charToEscapeQuoteEscaping", "^")
  .csv("x1")

$1$,$aaa$
$2$,$b,c$
$3$,$d%$e$
$4$,$f^%g$

No, you can't go deeper and escape ^...

Inandin answered 14/11, 2022 at 12:14 Comment(4)
extra ordinary explanation...Bortz
Is below summary accurate ? quote - enclose string that contains the delimiter i.e. comma in a csv escape - when the quote character is part of string, it is escaped with escape character escapeQuote - when the quote character is part of string, it is escaped with escape character, escapeQuote is used to ignore it. quoteAll - quote all the fields irrespective of whether they contain a delimiter or other special characters. charToEscapeQuoteEscaping - when the escape character is part of string, it is escaped with charToEscapeQuoteEscaping quoteAll = true overrides escapeQuotesBortz
Also does quote work just on delimiter or even on other special characters such as \r, \n, \000, \001 etc ?Bortz
Yes, that's quite accurate. A value containing \n is quoted, because that's a record delimiter. Anything else than quote character, record and field delimiter is not.Celeski

© 2022 - 2024 — McMap. All rights reserved.