Spark - CSV - Write Options - Quotes

Let's run through a few examples.

val df = Seq(
  (1, "aaa"),
  (2, "b,c"),
  (3, "d$e"),
  (4, "f%g")
).toDF("x", "y")
df.show

+---+---+
|  x|  y|
+---+---+
|  1|aaa|
|  2|b,c|
|  3|d$e|
|  4|f%g|
+---+---+

Whenever you have commas in values that would be indistinguishable from field delimiters (like 2,b,c) Spark can quote such values. It's double quotes by default (like 2,"b,c"), but you can customize it using quote option.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .csv("x")

1,aaa
2,$b,c$
3,$d\$e$
4,f%g

We used dollar sign here. Not only b,c was quoted, but since d$e contained a dollar, it was quoted as well and the dollar itself was escaped.

I don't know why you would like to do that, but you can ask Spark not to escape fields containing quote character using escapeQuotes option.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("escapeQuotes", false)
  .csv("x")

1,aaa
2,$b,c$
3,d$e
4,f%g

Here, d$e remained unquoted.

You can also force it to quote every field using quoteAll.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("quoteAll", true)
  .csv("x1")

$1$,$aaa$
$2$,$b,c$
$3$,$d\$e$
$4$,$f%g$

Note that quoteAll = true makes escapeQuotes irrelevant.

You can customize the escape character for the quote character inside quote using escape, if you don't like the default backslash.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("escape", "%")
  .csv("x1")

1,aaa
2,$b,c$
3,$d%$e$
4,$f%g$

Finally, note that there's a standalone escape character in f%g, so if you would like to escape that, use charToEscapeQuoteEscaping.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("escape", "%")
  .option("quoteAll", true)
  .option("charToEscapeQuoteEscaping", "^")
  .csv("x1")

$1$,$aaa$
$2$,$b,c$
$3$,$d%$e$
$4$,$f^%g$

No, you can't go deeper and escape ^...

Recommended topics

Hot tags